The 70% fact ceiling: why Google’s new FACTS benchmark is a wake-up call for enterprise AI

There is no shortage of generative AI benchmarks designed to measure the performance and accuracy of a model in performing a variety of useful enterprise tasks—from coding to following instructions to agent-like web browsing and tool use. But many of these benchmarks have one major flaw: they measure AI’s ability to perform specific problems and requests, not how factually the model is in its output—how well it generates objectively correct information related to real-world data—especially when dealing with information contained in images or graphics.

For industries where accuracy is paramount – legal, financial and medical – the lack of a standardized way to measure factuality is a critical blind spot.

That changes today: Google’s FACTS team and its Kaggle data unit have released the FACTS Benchmark Suite, a comprehensive benchmarking framework designed to fill this gap.

A related research paper reveals a more nuanced definition of the problem, separation "factuality" in two different operational scenarios: "contextual factuality" (grounding the answers in the data provided) and "factology of world knowledge" (retrieval of information from memory or network).

While the headline news is the top-tier Gemini 3 Pro available, the deeper story for builders is industry-wide "actual wall."

According to the initial results, no model – including the Gemini 3 Pro, GPT-5 or Claude 4.5 Opus – was able to overcome a score of 70% accuracy on the problem set. For tech leaders, it’s a signal: the age of "trust but verify" it’s far from over.

Deconstructing the benchmark

The FACTS package goes beyond simple questions and answers. It consists of four different tests, each simulating a different real-world failure mode that developers encounter in production:

  1. Parametric Benchmark (Inside Knowledge): Can the model accurately answer trivia-style questions using only its training data?

  2. Search metric (use tool): Can the model effectively use a web search tool to extract and synthesize live information?

  3. Multimodal Benchmark (Vision): Can the model accurately interpret graphs, charts, and images without hallucinating?

  4. Grounding Benchmark v2 (Context): Can the model adhere strictly to the source text provided?

Google has released 3,513 examples to the public, while Kaggle owns a private set to prevent developers from training on test data—a common problem known as "pollution."

The Rankings: A Game of Inches

The initial benchmark run puts Gemini 3 Pro in the lead with a comprehensive FACTS score of 68.8%, followed by Gemini 2.5 Pro (62.1%) and OpenAI’s GPT-5 (61.8%). However, a closer look at the data reveals where the real battlegrounds are for engineering teams.

Model

FACTS Score (Avg)

Search (RAG capability)

Multimodal (Vision)

Gemini 3 Pro

68.8

83.8

46.1

Gemini 2.5 Pro

62.1

63.9

46.9

GPT-5

61.8

77.7

44.1

Grok 4

53.6

75.3

25.7

Close 4.5 Work

51.3

73.2

39.2

The data is obtained from the FACTS Team release notes.

For Builders: The "Search" vs "Parametric" gap

For developers building RAG (Retrieval-Augmented Generation) systems, the Search Benchmark is the most critical metric.

The data show a huge discrepancy between the model’s ability to "i know" things (Parametric) and its ability to "found" things (Search). For example, Gemini 3 Pro achieves a high score of 83.8% on search tasks, but only 76.4% on parametric tasks.

This confirms the current enterprise architecture standard: don’t rely on the model’s internal memory for critical facts.

If you’re building an in-house knowledge bot, the FACTS results suggest that connecting your model to a lookup tool or vector database is not optional—it’s the only way to increase accuracy to acceptable production levels.

The multimodal alert

The most concerning data point for product managers is multimodal task performance. Scores here are universally low. Even the category leader, the Gemini 2.5 Pro, only reached 46.9% accuracy.

Comparison tasks include reading charts, interpreting charts, and identifying objects in nature. With less than 50% accuracy across the board, this suggests that multimodal AI is not yet ready for unsupervised data mining.

Bottom row: If your product roadmap includes AI automatically scraping data from invoices or interpreting financial charts without human review in the loop, you are probably introducing significant levels of error in your pipeline.

Why this matters to your stack

The FACTS benchmark is likely to become the standard reference point for public procurement. When evaluating models for enterprise use, technical leaders should look beyond the composite score and dive into the specific sub-benchmark that fits their use case:

  • Building a customer support bot? Take a look at the Grounding score to make sure the bot adheres to your policy documents. (The Gemini 2.5 Pro actually beats the Gemini 3 Pro here, 74.2 vs. 69.0).

  • Building a research assistantship? Prioritize search results.

  • Building an image analysis tool? Proceed with extreme caution.

As the FACTS team noted in their announcement, "All evaluated models achieved an overall accuracy below 70%, leaving considerable room for future progress."For now, the message to the industry is clear: models are getting smarter, but they’re still not infallible. Design your systems with the assumption that roughly one-third of the time the raw model may simply be wrong.

AI

#fact #ceiling #Googles #FACTS #benchmark #wakeup #call #enterprise

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *