• The Deep View
  • Posts
  • ⚙️ Interview: A new approach to AI evaluation

⚙️ Interview: A new approach to AI evaluation

Good morning. I sat down with self-driving expert Dr. Missy Cummings to chat all about the reality behind self-driving cars.

It’s a fascinating episode, if I do say so myself. Check it out!

— Ian Krietzberg, Editor-in-Chief, The Deep View

In today’s newsletter:

  • 🩻 AI for Good: Broken bones 

  • 🚘 Waymo goes international 

  • 💻 Report: AI Search gains aren’t enough to displace Google

  • 👁️‍🗨️ Interview: A new approach to AI evaluation

AI for Good: Broken bones 

Source: Unsplash

Britain’s National Institute for Health and Care Excellence (NICE) in October approved four AI tools to aid clinicians in the detection of broken bones on X-Rays.

The details: The recommendation allows TechCare Alert, BoneView, RBfracture or Rayvolve to be used in urgent care settings in the U.K. while evidence surrounding their performance is gathered in this real-world setting. 

  • Missed fractures reportedly occur in 3% to 10% of cases; the evidence so far suggests, according to NICE, that the AI platforms approved above may improve fracture detection compared with a clinician reviewing an X-Ray on their own. 

  • The idea is that the systems employ AI to recognize and flag any potential anomalies, which are then reviewed by professionals; because this isn’t in any way replacing the clinicians, NICE said it’s a relatively low-risk application of AI. 

Why it matters: “Using AI technology to help highly skilled professionals in urgent care centers to identify which of their patients has a fracture could potentially speed up diagnosis and reduce follow-up appointments needed because of a fracture missed during an initial assessment,” Mark Chapman, director of HealthTech at NICE, said. 

This free 3-hour Mini-Course on AI & ChatGPT (worth $399) will help you become a master of 20+ AI tools & prompting techniques and save 16 hours/week. 

This course will teach you how to:

  • Build a business that makes $10,000 by just using AI tools

  • Make quick & smarter decisions using AI-led data insights

  • Write emails, content & more in seconds using AI

  • Solve complex problems, research 10x faster & save 16 hours every week

Waymo goes international 

Source: Waymo

In the latest example of Waymo’s seemingly ceaseless expansion, the self-driving firm on Monday said that it would soon start testing its autonomous vehicles in Tokyo, its first international expansion. 

The details: The first Waymos will arrive in Tokyo early next year; the first stage of their deployment will involve the manual mapping of key areas around the city, all done in partnership with drivers from local taxi company Nihon Kotsu. 

  • The data gathered from this manual mapping process will be used to train the AI systems that operate the vehicles. 

  • It’s not clear yet when Waymo will be fully open for service in Tokyo, or how much of the city will be accessible to the self-driving vehicles. The company told CNBC that this initial testing phase is expected to take several quarters.

The landscape: Japan, according to the World Economic Forum, is actively exploring safer driving solutions for its aging population. As part of this, it has been testing self-driving ventures. 

Several local companies — Tier IV, ZMP and Monet Technologies — are building and actively testing self-driving cars. 

In the U.S., however, Waymo doesn’t really have much competition, especially given Cruise’s recent shutdown. As other firms have lagged behind or fallen off, Waymo has spent 2024 steadily expanding its areas of operation, recently announcing that it will soon begin testing in Miami, a significant step forward given the rainy weather conditions of the East Coast. 

  • I sat down with Dr. Missy Cummings, the director of George Mason University’s Autonomy and Robotics Center, to talk about self-driving cars. She breaks down how they work, what their limitations are, and what a more realistic, grounded future of self-driving cars might look like.

  • You can watch (or listen) to the episode here. 

  • China poised to investigate more US tech deals after Nvidia probe (The Information).

  • US finalizes $406 million chips subsidy for Taiwan's GlobalWafers​ (Reuters).

  • AI startup Databricks hits $62 billion valuation in $10 billion funding round (WSJ).

  • Dexcom’s over-the-counter glucose monitor now offers users an AI summary of how sleep, meals and more impact sugar levels (CNBC).

  • Canada is entering into uncharted political territory (Semafor).

If you want to get in front of an audience of 200,000+ developers, business leaders and tech enthusiasts, get in touch with us here.

Report: AI Search gains aren’t enough to displace Google

Source: Unsplash

AI Search platforms — led by Perplexity and OpenAI — have become more popular of late. A new report from SEO firm BrightEdge found that the AI search entrants are “gaining ground,” but displacing Google is likely not in the cards. 

The details: The report found that, in November, OpenAI’s search engine experienced a 44% month-over-month growth in referrals; Perplexity experienced a 71% growth. 

  • BrightEdge found that OpenAI search — which launched as SearchGPT in August — now has six times more search usage than Perplexity in terms of referral clicks.

  • “This rapid ascent puts ChatGPT on a trajectory to potentially capture a 1% market share in 2025,” according to BrightEdge, something that could translate to $1.2 billion(+) in revenue. 

But, Google: At the same time, Google has been expanding its AI Overviews, which, according to the company, are now reaching more than a billion users each day. Google’s AI Overviews, according to BrightEdge, have become far more stable than they were at launch, and Google’s search market share remains at 92.4%, “meaning that new entrants will need to coexist with and differentiate themselves from Google, rather than aiming to overtake the search giant.”

"This is a moment of inevitability in search; we've long anticipated the rise of AI, and now it's reshaping the search landscape before our eyes," Jim Yu, CEO and co-founder of BrightEdge, said in a statement. "The data clearly shows that the stakes have never been higher. Newer entrants like ChatGPT search and Perplexity are gaining ground, while Google’s AI Overviews are getting smarter.”

It remains unclear just how much the introduction of AI to search increases the regular energy consumption and carbon emissions of search. 

Interview: A new approach to AI evaluation 

Source: Unsplash

Note: the following was corrected on 12-19-2024 to address a typo

The past couple of years have seen a massive expansion of AI accessibility, Douwe Kiela, CEO and co-founder of enterprise AI startup Contextual AI, told me. Large Language Models (LLMs) and generative AI have never been more accessible than they are today; people no longer need to train models — or even understand a lick of code — in order to deploy models, thanks to APIs.  

The problem, Kiela said, is that, even as models have become more accessible, methods of evaluating those models have not. Evaluation still requires a deep expertise in data science and machine learning at a pretty granular level, according to Kiela, who said that it remains a relatively involved “manual process” that many people just don’t know how to do. 

  • “This would be fine if AI wasn't really used anywhere,” he said, chuckling. “But AI is used everywhere now. So it's becoming a huge problem, especially if you're  in a regulated industry, or something like that, you have to actually think very deeply about what you're doing there … the tools don't really exist for people to do that properly.” 

  • The ideal scenario, according to Kiela — who was one of the co-authors of the original RAG research paper — would be to make LLM evaluation accessible to developers “in the same way that you make language model APIs accessible.” 

Contextual AI on Tuesday introduced LMUnit, a system designed to do exactly that. 

The details: According to Contextual, LMUnit enables developers to define and evaluate natural language unit tests to get detailed, “fine-grained” understandings of model performance, something that allows for “precise diagnosis” of potential problems. 

  • Current methods for LLM evaluation — which involves the scoring of response content and quality — involve human annotation, automatic metrics and language model judging, where a separate model evaluates a model’s performance. The problem with these methods, according to Contextual, is that they’re either too expensive, require too much expertise or are simply not fine-grained enough to be of value

  • Similar to unit testing for traditional software, LMUnit powers unit testing that evaluates “discrete qualities of individual model outputs — from basic accuracy and formatting to complex reasoning and domain-specific requirements. This enables developers to evaluate LLM responses granularly to learn specific signals for improvement.”

The tests can be constructed — in natural language — manually or synthetically, and score each response with a “pass” or “fail.” 

“It just needs to be its own category of models,” Kiela said. “Just like we have an embedding model, which is different from a language model, because we need to take those embeddings and put them in our vector database … these are just separate models with different kinds of things they do. And so evaluation very clearly needs to be its own category.”

“It's not about models. It's about systems, and the entire system is what solves your problem,” he added, saying that the language model component often only makes up around “20% of that system.” 

LMUnit is now available both to the public through Contextual’s API, as well as to Contextual’s customers. 

Contextual closed an $80 million funding round in August

This comes amid both a broad push toward AI “agents” — which Kiela referred to as systems but with more hype — and a steady increase in enterprise AI adoption. As corporations commit to spending more and more money on AI products, many have become heavily focused on ensuring that they are deriving clear returns from their costly investments; unsolved reliability issues at the model level have paved the way for broader systems that allow companies to overcome those problems, which could enable broader deployment. 

Which image is real?

Login or Subscribe to participate in polls.

🤔 Your thought process:

Selected Image 2 (Left):

  • “This must be coastal week at The Deep View! Someone wishing they could travel over the holidays?”

Guilty. This week’s theme was: ‘places in Greece I’d rather be right now.’

Selected Image 1 (Right):

  • “I just gave it a quick glance today. Not feeling too well. I thought the bright white and the real one was just too bright.”

💭 A poll before you go

Thanks for reading today’s edition of The Deep View!

We’ll see you in the next one.

Here’s your view on smart glasses:

Only 12% of you currently use smart glasses.

47% of you don’t use them today, but expect you will soon; 30%, meanwhile, don’t ever plan on putting on a pair of smart glasses.

I’ll mess around with them, but I don’t love the idea of putting technology on my face. I’ll keep it in my hands, thanks.

Do you use Google's AI Overviews or other AI search platforms?

Login or Subscribe to participate in polls.