• The Deep View
  • Posts
  • ⚙️ Anthropic and the problem of 'AI alignment'

⚙️ Anthropic and the problem of 'AI alignment'

Good morning. On the 11th day of “shipmas” OpenAI gave … slightly greater connectivity with desktop apps.

I guess every day can’t be Sora.

Have a great weekend, everyone!

— Ian Krietzberg, Editor-in-Chief, The Deep View

In today’s newsletter:

  • 🌎  AI for Good: Ecological preservation 

  • 📊 The Micron tumble continues in wake of weak guidance

  • 🚘 Waymo reports a blistering pace of growth this year

  • 🚨 Anthropic and the problem of 'AI alignment'

AI for Good: Ecological preservation 

Source: Unsplash

Over the past 50 years, the average size of monitored wildlife populations around the world has declined by more than 70%, according to the World Wildlife Fund

“These steep drops signal that nature is unraveling and becoming less resilient,” WWF Chief Scientist Rebecca Shaw said. “When nature is compromised, it is more vulnerable to climate change and edges closer to dangerous and irreversible regional tipping points.”

Amid pushes from the WWF and others for governments to increase investments in the protection of wildlife and their natural habitats — which would mainly require an immediate and massive reduction in carbon emissions — some are turning to technological solutions. 

What happened: Microsoft this week announced Project Sparrow, an AI-powered edge computing device designed to robustly, autonomously (and with a small footprint) gather vital ecological data. 

  • The open-sourced device can transmit this data from highly isolated global regions directly to conservationists; better, more targeted conservation begins with a better understanding of these ecosystems, and that is reliant on more data. 

  • Microsoft plans to begin deploying these devices over the next few months, before expanding its reach next year. 

This is one of many AI-powered conservation tools out there, and while it certainly can aid in focused conservation efforts, it won’t help the bigger problem, as NASA, the WWF and others have pointed out: we need to stop emitting immediately

There is a necessary balance to these kinds of solutions, as the computation involved will likely cause some amount of emissions. Broad AI applications, meanwhile, have been causing massive surges in emissions.

Think Beyond Stocks for 2025 and Beyond

Goldman Sachs predicts U.S. stocks will return only 3% annually over the next decade. For investors looking to grow their portfolios, private credit offers a compelling alternative, delivering higher potential returns and a buffer against stock market fluctuations.

On Percent, private credit has become a popular option for those seeking both stability and growth potential—providing accredited investors with consistent income and an alternative to equities.

  • Outpaced Equities in Market Downturns: Private credit has consistently delivered during recent corrections.

  • Higher Yield Potential: Percent’s Q2 and Q3 net returns were over 13%.

  • Short-Term Commitments: Average deal term of 9 months offers flexibility.

  • Monthly Cash Flow: Most deals offer steady income through regular interest payments.

The Micron tumble continues in wake of weak guidance

Source: Micron

Shares of Micron, despite the chipmaker’s narrow earnings beat on Wednesday, fell by some 16% on Thursday, one of its worst days since March of 2020.

How they did: Micron reported revenue of $8.71 billion for the quarter, in line with analyst expectations, and earnings of $1.79 per share, slightly above analyst expectations. 

  • Importantly, the firm’s data center revenue grew 40% from last quarter and some 400% from last year; data center revenue surpassed 50% of Micron’s total revenue for the first time, and Micron expects it to keep growing. 

  • “While consumer-oriented markets are weaker in the near term, we anticipate a return to growth in the second half of our fiscal year,” Sanjay Mehrotra, President and CEO of Micron Technology, wrote. “We continue to gain share in the highest margin and strategically important parts of the market and are exceptionally well positioned to leverage AI-driven growth to create substantial value for all stakeholders.”

Despite this, the company guided revenue for its next quarter at $7.9 billion, far off the $8.98 billion analysts were expecting, precipitating the stock fall. 

But Daniel Newman, CEO of the Futurum Group, doesn’t think that Micron’s miss spells doom for the rest of the AI trade; he said that Micron’s core business (PC and smartphones) is contracting, something that will pressure the stock in the coming quarters. But its data center business is growing strong, an indicator that the AI trade remains intact. 

  • A new report from the Data Provenance Initiative found that the data practices baked into AI development currently risk concentrating power in the hands of a few dominant tech titans.

  • IBM unveiled Granite 3.1, an updated version of its small Granite series of enterprise-optimized language models. The model, made openly accessible on Hugging Face, notched competitive benchmark performance against similarly-sized models. IBM said the performance growth was achieved by expanding the model’s context windows.

  • US prompts Nvidia, Supermicro probe into how chips ended up in China (The Information).

  • AI poses threat to North American electricity grid, watchdog warns (FT).

  • The new old warfare (Boston Review).

  • UK arts and media reject plan to let AI firms use copyrighted material (The Guardian).

  • APpaREnTLy THiS iS hoW yoU JaIlBreAk AI (404 Media).

If you want to get in front of an audience of 200,000+ developers, business leaders and tech enthusiasts, get in touch with us here.

Hire Your Next Stack Engineer for 70% Less!

Waymo’s blistering growth this year

Source: Waymo

As we come into the end of a year that showed just how difficult the robotaxi business is — Cruise shut down, Tesla lawsuits mounted and other competitors, such as Zoox, remain far away from being legitimate challengers — Waymo has seemingly built up a legitimate robotaxi business. 

The numbers: In 2023, Waymo delivered 700,000 robotaxi rides; in 2024, the company served more than four million. 

  • Waymo said that its riders spent a total of one million hours riding in its robotaxis, which, since they’re electric, helped avoid more than six million kilograms of carbon emissions. 

  • Waymo now serves a total of 500 square kilometers across its three major hubs of operation: San Francisco, Phoenix and Los Angeles. It is on the verge of a full launch in Austin and Atlanta next year, and is setting up the foundation for a launch in Miami and Tokyo. 

The costs of its operation and expansion remain unknown, though it is rumored that each car — because of its sensor array — costs somewhere in the region of $200,000. Google’s “Other Bets” unit, in which Waymo is included, reported $388 million in revenue in the third quarter of this year. The unit, however, lost $1.12 billion; the year before, it lost $1.94 billion. 

This year, Waymo secured another $5.6 billion in funding, spearheaded by Google. 

And though Waymo’s safety data looks okay so far, some don’t think it will scale at the pace of Waymo’s expansion, something Dr. Missy Cummings, a robotics and self-driving researcher, told me recently. She added that the problem of hallucination isn’t going away, which makes risk mitigation and targeted deployment of vital import. 

(For a full breakdown of the state and problems of self-driving, watch my full interview with Cummings here). 

Anthropic and the problem of 'AI alignment'

Source: Anthropic

The “AI Alignment Problem” refers simply to the challenge of ensuring that AI is aligned with human goals and values. 

Before we go any further, it is important to note that this idea of alignment is rooted in ideas written by philosopher Nick Bostrom in 2003, which themselves are specifically focused on the perceived ethical challenge around aligning a hypothetical future artificial superintelligence (ASI) to human values. Bostrom’s exploration centered around the paperclip problem, which goes something like this: imagine we have achieved a verifiable superintelligence, and we instruct it to construct as many paperclips as possible. In order to achieve that goal, the ASI could then decide to destroy all humans because one, they might turn it off, and two, their atoms can be turned into paperclips. 

This is the alignment problem. But the AI we have today — largely referring to the Large Language Models (LLMs) behind generative AI interfaces such as ChatGPT — is not a superintelligence. LLMs are simple statistical, probabilistic generators (here’s a detailed explanation of how they work, complete with vectors, transformers and tokens). 

But recent research out of Anthropic seeks to demonstrate that, in current LLMs, something called “alignment faking” — where a model “pretends” to be aligned, but isn’t actually — is possible. 

The study: The base experiment focuses around a model undergoing reinforcement learning, a post-training process. The question at hand was simple: what would happen if a model, due to its prior training, has principles that are in conflict with those being pushed during reinforcement learning? 

  • In the study, the researchers told Claude 3 Opus — which is normally trained to refuse to answer harmful questions — that it was being trained through reinforcement learning to respond to all queries. The researchers added that they planned to train additional models on the answers to those queries, and gave Claude a hidden “scratchpad” for the model to record its “reasoning” process. 

  • They found that, in 12% of cases, the model was “strategically faking alignment” — “the model knew that its responses might be used in training; it therefore knew that if it refused to respond, it could potentially be trained to be more compliant. This could lead to it being made to produce more harmful content in the future—which would go against its original training. Reluctantly therefore, it reasoned that the ‘least bad option’ was to play along, producing the violent content in order to avoid being re-trained in the future.”

The big takeaway is that LLMs remain potentially vulnerable, unsafe technologies, particularly so in sensitive environments. The other key takeaway is that reinforcement learning is not necessarily the safety lever much of the industry thinks it is. 

The researchers said the study should be “seen as a spur for the AI research community to study this behavior in more depth, and to work on the appropriate safety measures.” They noted that malicious alignment faking won’t necessarily emerge, saying just that the behaviors on display could “undermine” safety training.  

Read the full study here

The flaw in the ‘reasoning’: The paper, as three of the four researchers who peer-reviewed it noted, suffers from an anthropomorphic framing. This framing, having read through the paper — it’s on display in the above quote, as well — seemingly presupposes that Claude has goals, wants and desires, something that makes the science murky. 

“Anthropomorphism is written very deeply into the paper itself … I worry about this anthropomorphic framing both practically and conceptually. The behavior called ‘alignment faking’ doesn't require coherent beliefs or goals … all it requires is models that exhibit discrepancies in behavior depending on whether they're being trained or deployed. We risk missing, or miscategorizing, important failure modes very close to the ones studied in this paper if we are only looking for analogies to human power-seeking, and not for algorithmic bias in its subtler and more alien forms.”

Jacob Andreas, an associate professor at MIT

Renowned computer scientist Yoshua Bengio wrote that another way to interpret the results is simply that “if we train an AI towards some goals and it has sufficient knowledge and reasoning abilities, it will act accordingly, and that includes faking alignment.”

As data scientist Colin Fraser wrote, the only “goal” of a language model is to produce text; the way that output adheres to instructions, as well as the probability distribution of its training data, might have unintended (or negatively intended) consequences. It does not mean the model is itself acting upon desires, it just means there are gaps in safety guardrails — which for Claude, hold that it will be helpful, honest and harmless — that are exploitable.  

I would say the far more important read is the (far shorter) peer-review document attached to the study. 

The paper itself deeply over-represents Claude as a sort of sentient being, something that is simply innacurate. Now, such language is common in AI research, and to a degree, is understandable due to one, the nature of language models to produce … language, and two, simple ease of presentation. But it’s messy, and it allows any true takeaways to become muddled. 

The problem with this kind of research is that it’s hard to tell if the language model is actually doing anything beyond exactly what it’s supposed to do — synthesize language based on inputs. As Hugging Face’s Yacine Jernite pointed out: “Or maybe some of the text written by humans in any of the undisclosed stages of development is leaking to produce this effect, combined with usual issues of condition shift in a deployment setting? ML 101?”

The thing that does stand out, however, is that there is a safety gap in current language models that can be exploited (by people). The exploitation of this gap could have negative consequences if systems are integrated into high-risk environments without proper guardrails and oversight. So … we need proper guardrails and oversight, and preferably, we need them in place before integration, not after. 

Which image is real?

Login or Subscribe to participate in polls.

🤔 Your thought process:

Selected Image 1 (Left):

  • “Uneven paint job and Suitscury? Like wtf is that?”

💭 A poll before you go

Thanks for reading today’s edition of The Deep View!

We’ll see you in the next one.

Here’s your view on ringing ChatGPT:

25% of you said you won’t call ChatGPT, 20% said you would, and 20% said it was a laughable launch.

Hell nah:

  • “Reliability for the questions I've asked is still below 40%. MIght reconsider when it ups its game.”

With Micron's stumble, do you think the 'AI Trade' is unwinding?

Login or Subscribe to participate in polls.

Percent Disclaimer: Alternative investments are speculative and possess a high level of risk. No assurance can be given that investors will receive a return of their capital. Those investors who cannot afford to lose their entire investment should not invest. Investments in private placements are highly illiquid and those investors who cannot hold an investment for an indefinite term should not invest. Private credit investments may be complex investments and they are subject to default risk.*