- The Deep View
- Posts
- ⚙️ OpenAI unveils powerful new model
⚙️ OpenAI unveils powerful new model
Good morning. OpenAI unveiled its next foundation model, o3, last week, in a reveal that succeeded in planting the spotlight firmly on the company.
The model spent the weekend at the center of a raging debate between researchers on Twitter over the actual capabilities — and therefore, actual impact — of the model.
But the release raises more questions than it answers. We get into it all below.
— Ian Krietzberg, Editor-in-Chief, The Deep View
In today’s newsletter:
👁️🗨️ AI for Good: Landmine detection
💻 Researchers unveil the Genesis project
📊 OpenAI unveils powerful new o3 model
AI for Good: Landmine detection
Source: Unsplash
The use of landmines in conflicts around the world has an enormous downstream impact. The problem of demining land for future civilian use is one that first requires the detection of unexploded devices, which is a process that could take decades.
The scope of the crisis has inspired evaluations of different technologies that might speed things along.
What happened: Earlier this year, a small team of researchers designed an artificial intelligence-based system that, when combined with a camera-equipped robot, is capable of rapid landmine detection.
The AI system — specifically, a deep learning algorithm — was designed specifically as a lightweight solution, meaning it can be run on the same iPhone being used to control the robot (all from a safe distance).
It was trained and finetuned on images and videos of landmines in various conditions and environments. According to the study, the system demonstrated a high efficacy rate in detecting two types of landmines — 97.69% for “butterfly” mines and 99.4% for “starfish” mines.
The researchers intend to continue their work here, specifically focusing in on enhancing the model’s robustness in dramatically different environments.
Never leave your business on hold: Revolutionize phone calls with Thoughtly
Generative AI is transforming the call center industry, and Thoughtly is at the forefront of this revolution.
Thoughtly’s advanced AI agents can handle phone calls in any language, delivering seamless, natural-sounding conversations 24/7. Your team can now focus on the work that matters most—while Thoughtly’s AI keeps your call center running smoothly.
Introducing Automations
Our latest feature, Automations, takes integration to the next level by allowing you to connect Thoughtly directly with any CRM. Sync data effortlessly, automate follow-ups, and create a fully connected ecosystem for sales, support, and beyond.
✨ Why Thoughtly?
Streamlined Efficiency: From sales to support, optimize every interaction across channels.
Seamless Integration: Works perfectly with your existing tech stack—including CRMs via Automations.
Real Insights: Access a comprehensive analytics suite for actionable feedback and improvement.
Researchers unveil real-world simulator: ‘The Genesis Project’
Source: The Genesis Project
Last week, a massive team of researchers unveiled something called “the Genesis Project,” the result of a 24-month-long research collaboration between 20 different labs designed specifically for advanced robotics.
The details: According to the team, Genesis is a “universal physics engine” capable of simulating a range of physical objects and environments, a lightweight robotic simulation platform, a photo-realistic rendering system and a generative engine, all wrapped up in one.
Unlike traditional generative AI models, Genesis seems to represent an integrated system of multiple components; the physics engine at the core — powered in part by a vision language model (VLM) — and the generative framework layered on top. The team is open-sourcing the underlying physics engine, and said it plans to openly release the generative component soon.
Genesis achieves this accurate reconstruction of physical environments, according to the researchers, by integrating “a wide spectrum of state-of-the-art physics solvers, allowing simulation of the whole physical world in a virtual realm with the highest realism.”
It’s not yet clear what the Genesis system will be actually be used for, though it seems to overcome traditional physics-based limitations of existing image and video generators.
Still, we won’t know much about its impact until the whole thing is released into the hands of researchers, where it will be tested, pushed and verified.
Italy’s data privacy watchdog fined OpenAI $15.58 million last week, the result of a 2023 investigation into the startup that found that OpenAI “processed users' personal data to train ChatGPT without first identifying an adequate legal basis.” OpenAI said it will appeal the ruling.
Google introduced Gemini 2.0 Flash Thinking last week, a model that, similar to OpenAI’s “o” series, employs Chain of Thought “reasoning” to better solve complex problems.
Exclusive-US data-center power use could nearly triple by 2028, DOE-backed report says (Reuters).
The ghosts in the machine (Harper’s Magazine).
Elon Musk endorses far-right Alternative for Germany party in upcoming election (CNBC).
Congress steers away from a damaging shutdown (Semafor).
It’s AI! It’s Crypto! Crypto meets AI (The Information).
If you want to get in front of an audience of 200,000+ developers, business leaders and tech enthusiasts, get in touch with us here.
OpenAI unveils powerful new o3 model
Source: OpenAI
On the last of its “12 days of OpenAI” series — which, until this point, had only really notably seen the wider launch of Sora — OpenAI unveiled its next foundation model family, o3, the successor to the o1 “reasoning” model that it released earlier this year. Both models feature private “Chain of Thought reasoning,” in which a model works through a problem step-by-step; a brute-force mimicry of human reasoning.
Like with o1, OpenAI has a few versions of o3; there’s the normal one, then there’s o3 mini, which can be used at different inference levels.
Now this was an unveil, not a launch. CEO Sam Altman said during a livestream that the company intends to launch o3 mini by the end of January, and o3 “shortly after that,” but it is quite unclear what hurdles remain between now and an actual launch of the product.
OpenAI researchers said that, currently, o3 is undergoing internal safety testing and interventions; in a bit of an uncharacteristic move for OpenAI, the company opened up applications for external safety testing of the mini version of o3. It’s not clear how long this process will take, or what kinds of safety risks are actually posed by o3 — if the improvements are as significant as OpenAI suggested, it might exceed OpenAI’s own safety framework for release.
The details: I’m going to preface all this by saying two things: one, OpenAI’s lack of transparency around what they’re actually building persisted with this launch (we don’t know anything about the model), and two, little of what OpenAI presented has actually been independently verified. A demo is just a demo.
But, according to internal OpenAI research, o3 smashes through benchmarks on coding, software engineering and math; on competition code, o3 scored a 2727 to o1’s 1891. On software engineering, o3 scored a 71.7 to o1’s 48.9. And on frontier math (some of the hardest math problems in existence) o3 boosted the state-of-the-art accuracy rate from 2% to 25.2%. On other benchmarks, o3 mini performed on par with or below o1 and GPT-4o.
While undoubtedly impressive on its face, we don’t know what the model was trained on — or even what the model itself is — to achieve those results, making it impossible to make any hard statements about o3’s capabilities. Until OpenAI increases transparency on these points — and they won’t — true evaluation will be almost impossible.
As AI researcher Chomba Bupe wrote, “they most likely have a ton of engineered scaffolding code to hold them together under the hood. If they were truly building a learning & adapting system they would demo how the model learns on the fly to solve totally new problems rather than making the model plagiarize a lesser-known problem, that someone already solved, to make it look good in public eyes.”
ARC-AGI: In 2019, computer scientist Francois Chollett published a paper that introduced an AGI benchmark, quantifying just how hard it would be to get AI models to a point of general intelligence. AGI, referring to artificial general intelligence, refers to a hypothetical evolution of AI that would feature human-like intelligence; many researchers do not believe it is achievable, in part, due to the field’s consistent inability to define what AGI might even be in the first place (plus a consistent inability to define what, exactly, human intelligence itself is).
Chollett has since published the ARC-AGI-1 benchmark, with the goal of directing research specifically toward AGI, and away from traditional benchmarks. The benchmark specifically features problems that are easy for humans to solve, but really challenging for AI models. OpenAI’s o3, as verified by ARC, scored a 75.7% on ARC’s AGI benchmark.
Importantly, the model was trained on the benchmark. ARC noted that o3 was trained on “75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.”
While this is common in machine learning, the problem (which was at the core of a Twitter battle over the weekend) is that the real significant measure of a possible AGI involves performance on problems a model was not trained on; there was a lot of confusion from OpenAI regarding the reality of that situation, with Altman saying during the live stream that the benchmark wasn’t “targeted” despite OpenAI using the training set.
Dumitru Erhan, Google DeepMind’s research director, called out the lack of scientific rigor surrounding the release, saying: “Why is everyone so intellectually uncurious, especially the challenge organizers, about the mechanisms by which this impressive submission that obliterates everything else out there works? Can we as a community not express some genuine scientific skepticism?”
A higher compute — 172x — configuration of o3 scored an 87.5% on the ARC benchmark after 16 hours. (85% was the key number to consider the benchmark beaten, but Chollett said that, until a highly efficient, open-sourced model can score an 85%, he will continue to run the ARC-AGI prize; 172x compute is not, after all, efficient whatsoever).
The cost in compute of achieving the first score was $2,000, or $20 per task; OpenAI requested that ARC not make public the cost associated with the 172x increase in compute required for the 87.5% score.
Chollett said that, despite the cost, this is a “genuine breakthrough.” But he said that a high score on ARC-AGI doesn’t mean that AGI has been achieved; it’s only one benchmark, and it’s become highly saturated. He plans to launch a second version of the ARC-AGI benchmark soon.
“I don't think o3 is AGI yet,” he wrote. “o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence,” adding that he expects o3 to struggle mightily with ARC-AGI-2. “You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”
The other element to this is that success on a benchmark does not address other limitations of the architecture; as Dr. Gary Marcus pointed out, Altman did not address issues of hallucination or algorithmic bias during the livestream whatsoever, and “was very light on applications outside of benchmarks in closed domains.”
The announcement also didn’t mention the 2024 ARC leaderboard, whose previous high score — at 53.5% — was achieved using Anthropic’s Claude Sonnet 3.5. Jeremy Berman explains how he got Sonnet to a 53.5% on the benchmark in this illuminating post.
The current ARC-AGI leaderboard.
I remain skeptical of internal benchmark numbers released without transparency. The model — or system, we don’t know — was very likely trained on those benchmarks, meaning that high scores are not necessarily indicative of much.
That said, this seems like a significant increase in the realm of the possible, judging by the fact that previous models were also likely trained on those benchmarks, making the difference in benchmark scoring a good place to start in ascertaining model-to-model differences.
If that jump in benchmarking capability is generalizable across the model — which is highly, highly unlikely — then I would say that o3 should be considered a “high” risk model, by OpenAI’s own safety framework, and thus, shouldn’t be released. But OpenAI is a for-profit company. They don’t care about AGI; they care about making money.
It’s profitable to hype up the capabilities of their models. Their goal is, after all, to sell these products. So, either o3 isn’t actually that powerful/dangerous, and it’ll be released, or it is actually that powerful/dangerous, and they’ll probably release it anyway.
And here’s another problem with OpenAI and corporate research; there’s no transparency. We don’t know the cost, in electricity, carbon emissions, hardware or time of building the model; we don’t know the cost in electricity and carbon emissions of operating the model at any level; we don’t know what the model was trained on; we don’t know anything about its internal architecture, which means researchers can’t know how powerful it actually is or isn’t … All we have is OpenAI’s word, and, as I mentioned, anything they say must be taken with a grain (or two, or three) of salt as the intention behind all of their actions and public interactions is to sell their products.
This is probably something. The clearest indication of its capability is the ARC-verified benchmark, but again, that’s just a benchmark. It’s just impossible to tell at this stage what o3 actually is, what impact o3 will actually have, if or when anyone will see it and whether it will ever be economically — and sustainably! — viable to deploy.
“Breakthroughs aren't just shouted out from the mountain tops,” Bupe wrote. “Scientific breakthroughs must be independently verified by a wider scientific community. As a product, I have no problems with the secrecy. But OpenAI trying to push their products as scientific research is what's wrong here.”
Which image is real? |
🤔 Your thought process:
Selected Image 1 (Left):
“I've been there!”
Selected Image 2 (Right):
“Seemed like AI could not replicate the complex shadowing.”
💭 A poll before you go
Thanks for reading today’s edition of The Deep View!
We’ll see you in the next one.
Here’s your view on the AI trade in light of Micron’s stumble:
34% of you don’t think the AI trade is unwinding; 25% think it is.
Yes:
“If not now, soon …”
To release, or not release ... If o3 is as powerful as OpenAI says, should it be released? |