- The Deep View
- Posts
- ⚙️ Musk releases Grok 3, a slightly better version of the competition
⚙️ Musk releases Grok 3, a slightly better version of the competition

Good morning. Former OpenAI CTO Mira Murati has launched her startup, Thinking Machine Labs. It’s not clear what exactly the lab will be doing, but it’s staffed by around 20 researchers formerly of OpenAI.
— Ian Krietzberg, Editor-in-Chief, The Deep View
In today’s newsletter:
🌊 AI for Good: Autonomous science in space
📊 Study: LLMs shouldn’t be used to replace human participants
👁️🗨️ Musk releases Grok 3, a slightly better version of the competition
🎙️ Podcast:
The latest episode of The Deep View: Conversations — an exploration of artificial companionship with Intuition Robotics CEO Dor Skuler — is out! Check it out below:
AI for Good: Autonomous science

Source: Unsplash
Science in space is hard.
Bethany Theiling, a planetary research scientist at NASA's Goddard Space Flight Center, is an ocean worlds geochemist; she, alongside research teams at NASA, studies oceans not just here on Earth, but across the solar system.
But studying those oceans that reside on other planets is a massive logistical challenge, where data exploration and exportation operate on a series of delays necessitated by the communication lags between home base and a given spacecraft.
What happened: Theiling and her team are developing an artificial intelligence model designed to “act as a scientist aboard a spacecraft,” something that could enable the far more seamless gathering of interplanetary oceanic data.
The goal, according to Theiling, is “science autonomy … We want multiple instruments to be able to collect data on board, that the science agent can analyze and make decisions about, including returning this information to Earth. This includes prioritizing, transmitting and deciding where and when to take the next samples.”
Though she said a fully autonomous science ‘agent’ is still a few years out, such an achievement would enable autonomous, uncrewed spacecraft to eventually respond to real-time events. Today, the system can do simple tasks; Theiling’s goal involves much more complexity.
Why it matters: “Ultimately no person can be on these spacecraft,” she said. “We are trying to create an AI science agent to find ‘eureka moments’ in real time on its own. We are trying to create AI independence through multiple observations.”

Automation is evolving—here’s what you need to know
The automation technology market is converging, and a new category of integrated solutions has emerged.
BOAT (Business Orchestration and Automation Technologies) combines the power of process automation, RPA, and iPaaS into a single solution that can also help to operationalize AI.
Download Camunda’s BOAT guide to learn more about the market shift and evaluate if this is the right automation strategy for you.
Study: LLMs shouldn’t be used to replace human participants

Source: Unsplash
As language models have become more advanced, some researchers have begun to suggest that they can be used to, at least partially, replace human participants in social science studies and surveys.
But the more scientists look into it, the more we’re hearing that LLMs remain limited in ways that could severely impact results.
What happened: A team of computer science researchers, through a series of human studies with more than 3,000 participants, identified two significant limitations: “LLMs are likely to both misportray and flatten the representations of demographic groups.”
The team — which posed a series of demographic questions to humans, in addition to GenAI models — tied these limitations back to the data that makes these models run, saying that the “limitations will likely persist so long as LLMs are trained on the current format of online text.” They argued that, because of this, “these limitations cannot be easily resolved by newer models.”
The team suggested that researchers should exercise a lot of caution if attempting to use LLMs to replace human participants, especially when peoples’ identities are relevant to the research being undertaken.
Why it matters: “Our results differ from work that affirmatively shows LLMs can simulate human participants,” lead author Angelina Wang said. “We test if LLMs can match the distribution of human responses — not just the mean — and use more realistic free responses instead of multiple choice. The details matter!”

You wouldn’t trust an AI bot to attend a networking event or to close a critical deal, so stop relying on AI for the wrong parts of your sales process.
Harness the strengths of humans + AI with Bounti– your AI teammate doing all the research and prep work for you, our humans to make sure the content meets your needs, and you- the expert closer. In minutes, Bounti gives you a toolkit with everything you need to win target accounts, so you can:
✔️ know your prospects and what they care about
✔️ land your pitch by connecting to buyer business objectives
✔️ and thoughtfully engage them with personalized outreach


OpenAI is considering granting special voting rights to its nonprofit board that would enable board directors to overrule major investors, protecting against potential hostile takeovers, according to the FT. It’s a measure that comes shortly following Elon Musk’s offer to purchase the nonprofit for nearly $100 billion.
A recent report from DiPLab found that the excitement over DeepSeek’s release of R1 ignores a vast ecosystem of underpaid, government-subsidized human data annotators, which were essential to the model’s construction.

How the drone battles of Ukraine are shaping the future of war (New Scientist).
US judge extends order to block DOGE from Treasury data (Wired).
Gecko Robotics plans to double UAE footprint (Semafor).
Hollywood writers say AI is ripping off their work. They want studios to sue (LA Times).
AI 'hallucinations' in court papers spell trouble for lawyers (Reuters).
Musk releases Grok 3, a slightly better version of the competition

Source: xAI
The news: Late Monday night, Elon Musk’s xAI launched Grok 3, the third iteration of its Grok chatbot. Like the earlier Grok releases, Grok 3 refers to a family of large language models (LLMs) — but unlike earlier versions of the chatbot, xAI, calling it the “smartest AI in the world,” believes this one takes the cake.
The details: Trained at xAI’s Memphis data center, which is lined with around 200,000 GPUs, Musk has said that Grok 3 was developed with “10x” the compute as Grok 2.
“Grok 3 is an order of magnitude more capable than Grok 2,” he said during a live-streamed demo of the chatbot on Monday. “[It’s a] maximally truth-seeking AI, even if that truth is sometimes at odds with what is politically correct.”
And in keeping with increasingly popular industry trends, Grok 3 was built by applying reinforcement learning to a pre-trained model. xAI’s head of research, Jimmy Ba, said during the demo that “pre-training is not enough to build the best AI. The best AI needs to think like a human.”
As such, part of the Grok 3 model family includes a ‘reasoning’ model, remarkably similar to OpenAI’s o-series or DeepSeek’s R1, which uses Chain-of-Thought reasoning during inference to better answer queries. xAI also announced the launch of its first ‘agent,’ a research tool somewhat unsurprisingly called “Deep Search,” a clear play on OpenAI’s Deep Research.
And similar to OpenAI, Musk said xAI is doing “some obscuration of the thinking so our model doesn’t get instantly copied. There’s more to the thinking” than what is shown.
So, we’ve got normal Grok 3, Grok 3 mini, Grok 3 Advanced Reasoning and Grok 3 Deep Search, a lineup of products that are available through X, or through a separate subscription directly to the Grok website or app. Several of these products are still in beta testing, though Musk said his team will be shipping improvements constantly, with a voice mode set to arrive in as little as one week.

Benchmarks: According to the team, Grok 3 beats out all the competition across a number of benchmarks, taking the top spot in the popular “chatbot arena” benchmark with a score of 1400, while also beating OpenAI, DeepSeek, Google and Anthropic on math, coding, science and reasoning benchmarks.
But none of this information has been peer-reviewed or independently verified, so it really doesn’t mean much. Even taking the benchmark data at face value, Grok outperforms the competition, but by a very slim margin, something that’s notable considering the fact that xAI built Grok with “10x more training than current best models.” As software engineer Paul Klein wrote, “when everyone says they're (state-of-the-art) on an eval, you start to question the eval.”
Other things to note: Musk said that xAI will open-source Grok 2 in a few months, after Grok 3 is stable. He also said that the company has also started work on its next data center cluster, one that will have five times the power requirements of the current cluster (roughly, 1.2 gigawatts).
Andrej Karpathy, formerly the director of AI at Tesla, spent some time putting Grok 3 through its paces, finding that the model is “somewhere around o1-pro capability, and ahead of DeepSeek-R1, though of course we need actual, real evaluations to look at.”
Adding that the model is “incredible” given the brief amount of time xAI took to put it together, Karpathy noted that “the models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days/weeks.”
It seems like a good model, but the most impressive thing is the speed at which it was built.


Everything seems to be equalizing.
I find it incredibly notable that, given the quantity of compute xAI is dealing with, it performs roughly on par with (or maybe a little above) the state-of-the-art, something that is inconsistent with the idea that scaling compute is all you need to build increasingly more powerful models.
I also find it notable that the entire industry is seemingly stuck on an identical course — research ‘agents,’ Chain-of-Thought reasoning to boost test time compute, reinforcement learning, massive models built off of data scraped from the internet, etc. There is no significant difference between this and any other model, right now.
xAI’s Deep Search, or OpenAI’s Deep Research, or Perplexity’s Deep Research … OpenAI’s o3, or DeepSeek’s R1, or Google’s Flash Thinking, or Anthropic’s Claude, or xAI’s Grok 3. What you have is a number of massively funded companies essentially reproducing the same product in different wrappers, something that at the very least indicates that there is absolutely no moat, a point investors don’t seem to be getting.
None of the major labs seem to be working on unique approaches or unique applications. It’s all chatbots and ‘agents,’ with the note that you should ignore hallucinations and use them anyway. We’re two years into this race, and we don’t have a killer app (chatbots aren’t it). We don’t have one thing everybody uses.
But we have companies increasingly locked in this benchmark and scale race, which means more data centers and less efficiency. That Memphis data center that Musk is so proud of has been actively contributing to the city’s air pollution problems for months — and clearly it was all worthwhile because Grok 3 scored a 1402 to Gemini’s 1385 on the Chatbot Arena …
As per usual, we don’t know the training data, we don’t know the details of model or system architecture, we don’t know the energy intensity and carbon emissions associated with both training and operating the model and we don’t have validation for its benchmark scores.
So, another day, another model.
The race goes on.


Which image is real? |



🤔 Your thought process:
Selected Image 2 (Left):
“Reflection in water seems more accurate of real life.”
💭 A poll before you go
Thanks for reading today’s edition of The Deep View!
We’ll see you in the next one.
Here’s your view on the NYT’s AI:
30% of you think it’s pretty sensible, 25% think it’s ridiculous and 20% said you read them before, but now you won’t.
Ridiculous:
“It's ridiculous that they can admit the benefits and value of the tool but also sue the same companies for creating it. Their content was publically available and not at their expense; they're just upset they didn't get a slice of a hyped pie.”
Something else:
“Curious to what extent it hurts their lawsuit position... we shall see.”
Do you like Grok 3? |
If you want to get in front of an audience of 200,000+ developers, business leaders and tech enthusiasts, get in touch with us here.