• The Deep View
  • Posts
  • ⚙️ The challenge of benchmarks and humanity’s final exam

⚙️ The challenge of benchmarks and humanity’s final exam

Good morning, and happy almost weekend!

We’re still in January, and already, our prediction about agents is looking pretty good. Yesterday, two more agents swelled the ranks of these automated task-doers.

So, the developers have the agents. I’m far more curious to see if people will actually use them. (Let me know your thoughts below!)

— Ian Krietzberg, Editor-in-Chief, The Deep View

In today’s newsletter:

  • 🔥 AI for Good: An eye on volcanos

  • 👁️‍🗨️ The strong ties between Microsoft and the Israeli military

  • 🤖 Perplexity drops an AI assistant (and so does OpenAI)

  • 📊 The challenge of benchmarks and humanity’s final exam

AI for Good: An eye on volcanos

Source: Unsplash

Among the projects in the works at the AI subset of NASA’s Jet Propulsion Laboratory is something called Volcano Sensorweb, a project that combines sensors and AI-enabled satellites to autonomously monitor volcanos around the world. 

The details: The challenge here is allocating the availability of high-resolution resources. NASA operates two satellites that fly over potentially volcanic regions four times per day, but these satellites are only equipped with moderate image resolution capabilities.

  • So, the data gathered by those satellites is streamed live to the Goddard Space Flight Center where it is processed by an AI algorithm designed to look for volcanic hot spots. 

  • If that software detects any hot spots, it automatically sends an observation request to a separate satellite, which processes that request using an onboard AI algorithm to properly orient itself before gathering high-resolution data of the area in question. 

This all occurs in the span of a few hours. 

The project, which has been running for around 20 years, has the simple goal of keeping the planet’s 50 most active volcanos under close observation. 

Start speaking a new language in the new year

Lapsed goals? Forgotten resolutions? That’s so 2024.

Find your follow-through with Babbel, the language-learning app designed to teach real-world conversations for any situation.

Whether you're:

  • Studying abroad

  • Dreaming of future travel, or

  • Looking to learn something new this year

Now’s the perfect time to start speaking a new language! From Spanish and French to Italian, Babbel focuses on learning through speaking—so you’ll be ready for real-life conversations.

With just 10 minutes a day, you could start having actual conversations in just 3 weeks.

🎉 Special Offer: Get 55% Off! 🎉
Right now, Babbel is offering an exclusive 55% discount for readers of The Deep View.

The strong ties between Microsoft and the Israeli military

Source: Unsplash

Just a day after the Washington Post reported on some of the previously unknown details regarding the relationship between Google and the Israeli military, The Guardian, in collaboration with several other outlets, reported — based similarly on leaked documents — that Israel’s military increasingly relied on Microsoft’s cloud and AI tech in the wake of the Oct. 7 attack in 2023. 

The details: This relationship reportedly involved a minimum of $10 million worth of deals enabling Microsoft to provide roughly 17,000 hours of technical support to several factions — including the elite intelligence division, unit 8200 — within the Israeli military. 

  • Beyond cloud storage, Microsoft reportedly provided the Israeli military with “large-scale access” to OpenAI’s GPT-4, the large language model behind ChatGPT. This was done through Microsoft Azure, rather than being provided directly by OpenAI. 

  • Similar to the exposed documents concerning Google’s involvement with the IDF, it remains unclear exactly how these AI systems were actually used by the Israeli military. The documents reportedly suggest that these systems were used to aid in translation efforts and speech-to-text conversion. 

Reportedly, a “significant” portion of the AI tech purchased by the Israeli military was deployed in “air-gapped” systems, referring to systems not connected to the internet or otherwise public networks. Such an approach is generally undertaken to protect highly sensitive information. 

OpenAI told The Guardian that it does not “have a partnership with the IDF.” The startup quietly changed the language in its usage policies last year, deleting a prior restriction against the use of its tech for “military and warfare” activities. 

The increasing visibility of generative AI’s deployment in global warfare has raised ethical and moral considerations, including those based on the technical reliability of these systems, the potential for overreliance on faulty tech and the danger inherent to automated decision-making in war. South Korea last year worked to establish an international blueprint for the responsible use of AI in the military, which was endorsed by around 60 countries, including the U.S., whose military forces have been leveraging AI tech for years. 

  • OpenAI, Softbank each commit $19 billion to Project Stargate (The Information).

  • TikTok users allege censorship, altered algorithms after Trump saved platform (Semafor).

  • Why Amazon is struggling to crack Argentina (Rest of World).

  • Trump signs executive order promoting crypto, paving way for digital asset stockpile (CNBC).

  • Bill Gates’ nuclear energy startup inks new data center deal (The Verge).

If you want to get in front of an audience of 200,000+ developers, business leaders and tech enthusiasts, get in touch with us here.

Discover Creative Talent That Drives Results

Athyna connects you with top-tier talent from Latin America, making hiring easy and cost-effective.

  • Talent with experience at companies like MediaMonks, R/GA, and Ogilvy.

  • Save up to 70% on salaries without compromising quality

  • Enjoy a $1,000 discount exclusively for readers of The Deep View.

  • A new startup — Laina — offering the automatic tracking and navigation of AI use across groups and teams at corporations working to maintain data security came out of stealth Thursday with $10 million in funding.

  • Sam Altman announced that the free tier of ChatGPT will be getting access to o3 mini, a move that shortly follows the release of DeepSeek’s R1 model.

Perplexity drops an AI assistant (and so does OpenAI)

Source: Perplexity

In its latest challenge to Google, the AI search startup Perplexity on Thursday unveiled what it is calling the Perplexity Assistant, a multi-modal generative AI system designed to do more than just search. 

The details: In a series of promotional videos shared by Perplexity, the assistant — which seems to be functioning similarly to something like Siri — is shown completing a number of tasks, from booking dinner, searching for a picture of a book, summarizing text on a phone screen and setting reminders. 

  • Perplexity says that the assistant would browse the internet — performing tasks and using tools as needed — on behalf of the user, something that could speed up bookings and purchasing in general. 

  • It’s currently available on the Google Play Store; it’s not clear when or if the assistant will be coming to IOS. 

It’s also unclear how user data will be stored, processed, used or protected by Perplexity in the operation of its assistant. It is likewise unclear what the energy intensity or carbon footprint of this assistant is, or how it differs from Perplexity’s normal search platform. 

Perplexity did not respond to a request for comment regarding the above points. 

It might not meet your definition of one — and it is presumably relatively brittle — but the ‘agent’ component of our 2025 predictions is certainly bearing fruit. 

At the same time, OpenAI released a research preview of something called ‘Operator,’ a new AI “agent” that can “go to the web to perform tasks for you.” Operator is currently only available to Pro users in the U.S., though OpenAI said it will eventually expand to other tiers. The system, beyond its similarities to Perplexity’s assistant, is remarkably similar to Anthropic’s “Computer Use” functionality.

The challenge of benchmarks and humanity’s final exam

Source: Created with AI by The Deep View

If your only marker for LLM capabilities involves the benchmarks that attempt to measure their capacity, then it would certainly seem as though generative AI models are becoming quite capable. 

Leading generative models, including GPT-4o, OpenAI’s o1, Gemini 1.5 and Claude 3.5 Sonnet have all achieved scores of around (or above) 90% on benchmarks including MMLU (a widely-used benchmark intended to assess general knowledge capabilities in LLMs) and score highly on Math and other benchmarks. 

Indeed, OpenAI’s o3 model allegedly set new records for all these benchmarks, though the model and its benchmark test results lack independent evaluations, transparency and peer reviews. That said, it certainly seems powerful, at least, relative to the competition.  

This is at the core of an ongoing debate within the field over the actual capacity, compared to the illusion of that capacity, of LLMs. 

As cognitive scientist Dr. Melanie Mitchell put it last year: “It’s true that systems like GPT-4 and o1 have excelled on ‘reasoning’ benchmarks, but is that because they are actually doing this kind of abstract reasoning? Many people have raised another possible explanation: the reasoning tasks on these benchmarks are similar (or sometimes identical) to ones that were in the model’s training data, and the model has memorized solution patterns that can be adapted to particular problems.”

  • Some research seems to support the latter explanation, something that is perhaps bolstered by the fact that these models tend to exhibit exceedingly uneven performances; they might score super high on one benchmark, then fail an outrageously easy task in the next second, something that points toward a lack of generalizability.

  • And an ongoing lack of transparency regarding the training data of major models from prominent developers makes this kind of evaluation almost impossible. 

All that to say, benchmarks are not necessarily indicative of reasoning capability. And further, as Dr. Gary Marcus pointed out during OpenAI’s o3 unveil, these benchmarks do not address hallucinations and algorithmic bias, which could impact efficacy far more than performance on a test. 

But that is not stopping Dan Hendrycks from trying his hand at a new benchmark, rather dramatically named “Humanity’s Last Exam.” According to the New York Times, he was originally going to call it “Humanity’s Last Stand,” but scrapped it for being overly dramatic. So at least he’s somewhat self-aware. 

The details: Hendyrcks, a well-known AI safety researcher and the director of the Center for AI Safety worked with Scale AI (a company he advises) to develop a benchmark consisting of 3,000 challenging short answer and multiple-choice questions, questions designed to stump current LLMs. 

  • To develop the benchmark, the Center for AI Safety fielded submissions from over 1,000 subject matter experts around the world. The team first ran these questions through leading generative AI systems; all the questions that stumped the models were then moved on to a team of human experts for refining and verification purposes. 

  • Question submitters were paid between $500 and $5,000 per question

The final list of questions was then fed to six leading models, including Gemini 1.5 Pro, Claude 3.5 Sonnet and OpenAI’s o1. All of the models performed abysmally, with o1 achieving a leading score of 9.1%. 

Still, the Center said in a paper associated with the public release of the dataset that it expects the benchmark to quickly become highly saturated, and wouldn’t be surprised if models can surpass 50% on the benchmark by the end of 2025. 

The paper noted that the benchmark does not assess “open-ended research or creative problem-solving abilities, making it a focused measure of technical knowledge and reasoning.”

Without assessing the training data, these benchmarks appear largely meaningless. Here’s a sample question posted on the benchmark’s website: “In Greek mythology, who was Jason's maternal great-grandfather?” 

I’d be surprised if the answer to that question was not already in the training sets for all of the models on trial here. And I’d be even more surprised if the developers don’t train the next versions of these models using this benchmark, in order to beat it. 

Benchmarks do not equate to efficacy or usability. And they certainly don’t equate to intelligence. I find no compelling reason that any benchmark could possibly indicate true generalizability or reasoning capabilities, since training data remains opaque. 

And, I have to say, while Hendrycks seems aware enough of dramatic phrases to switch from “Humanity’s Last Stand” to “Humanity’s Last Exam,” the new version is almost as dramatic as the first one. 

And that’s just unnecessary. 

Which image is real?

Login or Subscribe to participate in polls.

🤔 Your thought process:

Selected Image 2 (Left):

  • “I'm not positive but I think the statue in Image 1 is facing the wrong direction...”

💭 A poll before you go

Thanks for reading today’s edition of The Deep View!

We’ll see you in the next one.

Here’s your view on Project Stargate:

37% of you do not trust the people involved with this venture.

21% are feeling bad about it and 14% think it’s great.

Bad:

  • “ … lots of cool things are happening. I'm happy to see it. But unless these companies can show how AI is 'actually' making life better for all of us versus just the ones who can afford to invest in tech, this effort will hit a wall too and stall. In the meantime, the cloud companies will continue to profit until they too have to shift emphasis to the benefit of the wider population needs.”

Bad:

  • “No regulation in place before pushing forward is insane.”

Could you see yourself actually using these agents (Perplexity Assistant, Operator, etc) for online bookings?

Login or Subscribe to participate in polls.