• The Deep View
  • Posts
  • ⚙️ ARC launches new AGI benchmark focused on efficiency

⚙️ ARC launches new AGI benchmark focused on efficiency

Good morning. OpenAI has upgraded its native image generation in ChatGPT, an upgrade whose main calling card is better text representation within generated images.

In early tests, it does appear far more capable than other systems on the text side of image generation. But it’s not nearly as clean as it was in OpenAI’s demo.

Surprise, surprise.

— Ian Krietzberg, Editor-in-Chief, The Deep View

In today’s newsletter:

  • 🎙️ Podcast: The five taboos that Silicon Valley broke

  • 🚘 Waymo is expanding to the East Coast

  • 🚁 NASA report: LLMs should not be adopted for critical safety assessments

  • 👁️‍🗨️ ARC launches new AGI benchmark focused on efficiency

🎙️ Podcast: The five taboos that Silicon Valley broke

I had the pleasure of connecting with Igor Jablokov, the founder and chairman of Pryon, to talk about the ways in which the field of AI has evolved over time. 

Igor worked as a program director at IBM, where he developed an early version of IBM Watson, before striking out on his own. His first startup was acquired by Amazon, where it evolved into Alexa. 

He talks about a time when the AI field was a small, underlooked and under-funded group, a group that was dedicated to attempting to better human life through technology. That’s not so much the case anymore; AI has become The Thing, and as excitement around it has blossomed, so too has a combination of blistering hype and hundreds of billions of dollars worth of investment. 

The impetus behind this change, in Igor’s view, is that Silicon Valley broke several fundamental taboos within the industry: trawling and downloading internet data without a care for copyright, releasing flawed and faulty systems, compromising user data and painting a story of impending synthetic life, which has led to over-reliance on those flawed and faulty systems. 

And that’s just the beginning — for his thoughts on general intelligence (which can’t be brute-forced!) and the ways in which AI is “like an opera,” you’ll want to check this one out.

Is AI Actually Making Test Automation Faster? The Data Might Surprise You

Rainforest QA surveyed 625 developers, engineers, and tech leaders to uncover what’s working (and what’s not) in test automation.

In this report, you’ll learn:

  • 81% of teams now use generative AI in software testing—but has it delivered real speed improvements?

  • The surprising truth about AI’s impact on test creation and maintenance time.

  • Which automation tools and strategies consistently improve testing efficiency?

  • How smaller teams approach test automation differently from larger organizations.

  • When teams make the switch from manual to automated testing—and why.

  • Key challenges teams face when integrating AI into their testing workflows and how to overcome them.

Unlock the data-backed insights that can help you streamline testing and deliver faster. Don’t wait—download the full report today!

Waymo is expanding to the East Coast

New

Source: Waymo

Self-driving firm Waymo said Tuesday that it is sending its autonomous vehicles back to Washington D.C. to lay the groundwork for a full launch of its autonomous ride-hail service. 

Waymo aims to launch the service in 2026. 

The details: But getting a self-driving ride hail service set up in D.C. won’t be quite as straightforward as it is in, say, Texas. D.C. law currently requires a human to be behind the wheel during self-driving tests, and further prohibits driverless commercial operations, according to the District Deaprtment of Transportation (DDOT).

  • Waymo acknowledged this in its statement, saying that it plans to “work closely with policymakers to formalize the regulations needed to operate without a human behind the wheel.”

  • A DDOT spokesperson told me that the Department is aware of Waymo’s announcement, adding that it is “actively developing a permitting framework to support the safe and responsible testing of autonomous vehicles in Washington, D.C.”

The spokesperson said that the Department is engaged in an “iterative rule-making process” with a focus on “public input, emerging best practices, and lessons learned from peer jurisdictions. Our priority remains ensuring that any company operating in the District — such as Waymo — does so in a manner that prioritizes safety, aligns with our regulatory framework and integrates seamlessly into DC’s unique transportation ecosystem."

Waymo said that it currently delivers 200,000 rides per week across an active fleet of more than 700 vehicles (300 in San Francisco, 200 in Phoenix, 100 in Los Angeles and a “limited number” in Austin). The average duration or distance of each ride and the number and type of interventions made by remote human operators remain unknown. 

Responding to a request for comment, Waymo told me that it doesn’t share details regarding human interventions or average ride distance/duration.

The self-driving firm currently operates in San Francisco, Phoenix, Los Angeles and Austin, with plans to expand to the East Coast next, beginning with Atlanta, Miami, and then, Washington D.C. 

Why it's a big deal: There’s a reason Waymo started on the West Coast: the weather is far more conducive to the sensors and cameras that make its robocars tick. The East Coast, with its varied weather patterns and all that rain, presents a much greater challenge and a much more significant risk. And Washington D.C., meanwhile, marks the first city Waymo is driving toward that gets snow, a whole new challenge. 

Each Waymo comes laden with cameras, lidar and radar, three technologies tied together with neural networks. But lidar is known to break in the rain and cameras are vulnerable to conditions with low visibility (think of raindrops or fog on the lens). Now, Waymo says that its radar system works well even in challenging weather conditions, but, as self-driving expert Dr. Missy Cummings told me last year, self-driving cars are just not suited for every environment: “we need to think more about risk mitigation and put them in the domains that make sense.”

Enterprise AI doesn’t Have to be Complicated

Sana Agents empower every department with AI that actually understands your business.

Deploy secure, no-code agents that connect your data, automate workflows, and delight users—that's enterprise AI done right.

  • The model race: DeepSeek released a major upgrade to its V3 language model, boasting major benchmark improvements compared to its predecessor. Google, meanwhile, unveiled Gemini 2.5, a next-gen family of ‘reasoning’ models.

  • ChatGPT’s imagery upgrade: OpenAI in a Tuesday live stream launched an upgrade of native image generation — which includes image editing — in its 4o model. Sam Altman called it a “huge step forward.” In the demo, the model seems capable of generating clean text in its image output.

  • Delete your DNA from 23andMe right now (Washington Post).

  • Apple says it’ll use Apple Maps Look Around photos to train AI (The Verge).

  • Napster pioneered music sharing over 25 years ago. It just got bought for $207 million (CNBC).

  • How to get computers — before computers get you (Wired).

  • Consumer confidence in where the economy is headed hits 12-year low (CNBC).

NASA report: LLMs should not be adopted for critical safety assessments

Source: Unsplash

Increasingly, researchers and regulators, in exploring the opportunities afforded them by the advent of large language models (LLMs), have suggested the application of LLMs for safety-critical assessments. 

The Federal Aviation Administration (FAA), for example, is exploring integrations of LLMs for the safety certification of aircraft. 

What happened: A recent NASA report analyzed the proposed approaches, finding a concerning lack of evidence that LLMs are fit for such applications. 

  • The problem here comes back to LLMs’ propensity to confidently produce incorrect information (or hallucinate), a phenomenon the researchers describe as “Frankfurtian BS,” since LLMs generate “an indiscriminate mixture of truth and falsehood” and have “no awareness of truth or intent to mislead.” 

  • In low-risk applications, they write, this isn’t as much of a problem. But in high-risk applications — such as, for example, certifying that an aircraft is safe to fly — anyone using an LLM must thoroughly and carefully review any output, working to discern and separate actual truth from the BS, a time-consuming and costly task. 

“The hard part of writing or reviewing an assurance argument is not the typing or the reading, it is the thinking. But LLMs don’t think, they BS, thereby creating a need for careful review of their output,” according to the report. “As long as there remains any human need to review the LLM’s output for veracity, thinking is precisely the part of the writing or reviewing task that cannot be automated away. If an LLM generates an evidentiary premise, but a human must check development artifacts to be sure that it is correct … there is little to be saved but a small amount of typing.”

Much more research needs to be done to provide evidence that LLMs are actually fit for such applications, the researchers write. Until that evidence is on hand, LLMs should be considered an “experimental” technology. 

“We should not adopt technology into the regulatory pipeline without acceptable reasons to believe that it is fit for use in the critical activities of safety engineering and certification,” they write. “LLMs are machines that BS, not machines that think, and thinking is precisely the task that must be automated if the technology is to improve safety or lower cost.”

ARC launches new AGI benchmark focused on efficiency 

Source: Unsplash

As far as indicating legitimate capabilities and real-world performance, benchmarks — though wildly popular — tend not to accomplish much. 

Several years ago, computer scientist Francois Chollett launched a benchmark — ARC-AGI-1 — intended to legitimately assess whether models are capable of reasoning. His goal is for Arc’s benchmarks to serve as a “North Star” toward artificial general intelligence, or AGI, that hypothetical, loosely-defined and scientifically dubious evolution of AI that would possess human-adjacent intellect. 

The fundamental idea behind Arc’s benchmark is that its tasks are easy for humans but hard for AI systems. 

To Arc, AGI is “the gap between the set of tasks that are easy for humans and hard for AI. When this gap is zero, when there are no remaining tasks we can find that challenge AI, we will have achieved AGI.” 

The focus is on generalizability beyond a training set, something that’s hard to verify when developers keep their training data and processes under lock and key. 

What happened: This week, Chollett launched ARC-AGI-2, a second iteration of his AGI benchmark that, unlike the first one, focuses not just on generalized intelligence, but on efficiency, as well. 

“Unlike ARC-AGI-1, this new version is not easily brute-forced,” Chollett wrote

  • Each of the tasks in the benchmark was solved in under two attempts by a minimum of two humans, at a cost per task of $17. But even o3 low, the reasoning model from OpenAI that achieved a 75% on ARC-AGI-1at an undisclosed cost — would only score a 4% on the new benchmark, at a cost per task of $200. 

  • Pure language models couldn’t get above a 0%, and single Chain-of-Thought models like DeepSeek’s R1 landed below a .5%. 

“Intelligence is not solely defined by the ability to solve problems or achieve high scores, Arc wrote. “The efficiency with which those capabilities are acquired and deployed is a crucial, defining component.”

So the cost per task will be an integral part of the ARC-AGI-2 benchmark; contestants for the Arc prize only get $50 of compute per submission, and must open-source their solutions in order to receive the prize money. 

I like seeing the focus on efficiency. Intelligence, as I’ve said before, cannot be brute-forced. By that very nature, large language models are kind of barking up the wrong tree from the start. 

Still, transparency around the underlying models is lacking; since the training data is so vast, and since the details of the information within that training data remains obscured, it’s difficult to put any real weight behind benchmark performance, even on a benchmark like this. 

In its current and unsaturated form, this feels like a decent indicator of LLM limitations; having been trained on an internet-scale corpus of information, a quantity of data that we mere mortals can hardly comprehend, the supposed ‘intelligence’ exhibited by these models remains weak, brittle and inefficient. 

Still, this affirmation is nothing new; the jagged edge of LLM capability to unpredictable, confident failures — Frankfurtian bullshit, more commonly recognized as hallucination — is one feature of these systems that has not gone away. 

Which image is real?

Login or Subscribe to participate in polls.

🤔 Your thought process:

Selected Image 2 (Left):

  • “Wtf is the saw even cutting in image 1?” The table?

Selected Image 1 (Right):

  • “I was sure Image 2 was way too clean to be real.”

💭 A poll before you go

Thanks for reading today’s edition of The Deep View!

We’ll see you in the next one.

Here’s your view on cyber attacks:

40% of you said your company has not yet been a victim of an AI-related cyber attack.

But a third of you said your company has.

Yes:

  • “Personal, cyber security home networks have never been more important.”

Not yet:

  • “I don't think so, but I work for the US Government and it's probably just a matter of time given everything going on right now.”

My East Coasters - you excited for Waymo in DC?

Login or Subscribe to participate in polls.

If you want to get in front of an audience of 450,000+ developers, business leaders and tech enthusiasts, get in touch with us here.