Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124


The intelligence of AI models is not what is blocking enterprise deployments. It is the inability to define and measure quality in the first place.
This is where AI referees now play an increasingly important role. In evaluating AI, a "a judge" is an AI system that evaluates the results of another AI system.
Judge Builder is Databricks’ framework for creating judges and was first implemented as part of the Agent Brix technology earlier this year. The framework has evolved significantly since its initial launch in response to direct feedback from users and implementations.
Early versions focused on technical implementation, but customer feedback revealed that the real bottleneck was organizational alignment. Databricks now offers a structured workshop process that guides teams through three main challenges: getting stakeholders to agree on quality criteria, capturing domain expertise from limited subject matter experts, and implementing evaluation systems at scale.
"Model intelligence is usually not the bottleneck, models are really smart," Jonathan Frankel, chief artificial intelligence scientist at Databricks, told VentureBeat in an exclusive briefing. "Instead, it’s really about how do we get the models to do what we want, and how do we know if they’ve done what we wanted?"
Judge Builder addresses what Pallavi Koppol, a Databricks scientist who led the development, calls "Problem with Ouroboros." Ouroboros is an ancient symbol that depicts a snake eating its own tail.
Using AI systems to evaluate AI systems creates a round-robin validation challenge.
"You want a judge to see if your system is good, if your AI system is good, but then your judge is also an AI system," Coppol explained. "And now you say, well, how do I know this judge is good?"
The solution is measurement "distance to human expert truth" as the primary scoring function. By minimizing the difference between how an AI judge evaluates outcomes versus how domain experts would evaluate them, organizations can trust these judges as scalable surrogates for human evaluation.
This approach differs radically from the traditional one security systems or one-metric estimates. Rather than asking whether the AI output has or has not passed a general quality check, Judge Builder creates very specific evaluation criteria tailored to each organization’s domain expertise and business requirements.
The technical performance also sets it apart. Judge Builder integrates with Databricks’ MLflow and fast optimization tools and can work with any base model. Teams can control versions of their judges, track performance over time, and deploy multiple judges simultaneously across different quality dimensions.
Databricks’ work with enterprise clients has revealed three critical lessons that apply to anyone building AI referees.
Lesson one: Your experts don’t agree as much as you think. When quality is subjective, organizations find that even their own subject matter experts disagree on what constitutes an acceptable result. The customer service response may be factually correct but use an inappropriate tone. A financial summary may be comprehensive but too technical for the target audience.
"One of the biggest lessons from this whole process is that all problems become people’s problems," Frankel said. "The hardest part is taking an idea out of a person’s brain and turning it into something clear. And what’s more difficult is that companies are not one brain, but many brains."
The patch is a batch annotation with inter-rater reliability checks. Teams annotate examples in small groups, then measure results for agreement before moving on. This catches the discrepancy early. In one case, three experts gave ratings of 1, 5, and neutral to the same result, before discussion revealed that they interpreted the rating criteria differently.
Companies using this approach achieve inter-rater reliability scores of up to 0.6, compared to typical scores of 0.3 from external annotation services. Higher agreement translates directly into better judge performance because the training data contains less noise.
Lesson 2: Break down fuzzy criteria into specific judges. Instead of a judge deciding whether the answer is one "relevant, factual and concise," create three separate judges. Each targets a specific aspect of quality. This granularity is important because of failure "overall quality" the result reveals that something is wrong, but not what to fix.
The best results come from combining top-down requirements, such as regulatory constraints, stakeholder priorities, with bottom-up discovery of observed failure patterns. One client built a top-down correctness judge, but found through data analysis that the correct answers almost always cited the first two retrieval results. This insight became a new production-friendly judge that could display correctness without requiring ground-truth labels.
Lesson Three: You need fewer examples than you think. Teams can create robust judges from just 20-30 well-chosen examples. The key is to choose extreme cases that expose disagreement rather than obvious examples where everyone agrees.
"We are able to complete this process with some teams in as little as three hours, so it doesn’t really take that long to start getting a good judge," Coppol said.
Frankle shared three metrics that Databricks uses to measure Judge Builder’s success: whether customers want to use it again, whether they’re increasing their AI spend, and whether they’re progressing on their AI journey.
On the first metric, one client created more than a dozen judges after their initial workshop. "This client made over a dozen judges after we walked them through it rigorously the first time with this framework," Frankel said. "They really went to town on judges and now they measure everything."
For the second indicator, the business impact is clear. "There are multiple customers who have gone through this workshop and become seven-figure GenAI spenders at Databricks in a way they weren’t before." Frankel said.
The third indicator reveals the strategic value of Judge Builder. Clients who were previously hesitant to use advanced techniques like reinforcement learning now feel confident implementing them because they can measure whether improvements have actually occurred.
"There are clients who have gone and done very advanced things after having these judges where they were reluctant to do it before," Frankel said. "They moved from doing some rapid engineering to doing reinforcement learning with us. Why spend the money on reinforcement training and why spend the energy on reinforcement training if you don’t know if it really matters?"
Teams that successfully move AI from pilot to production treat judges not as disposable artifacts, but as evolving assets that grow with their systems.
Databricks recommends three practical steps. First, focus on high-impact judges by identifying one critical regulatory requirement plus one observed failure mode. These become your initial judging portfolio.
Second, create lightweight workflows with subject matter experts. A few hours of reviewing 20-30 extreme cases provides sufficient calibration for most judges. Use batch annotations and inter-rater reliability checks to de-noise your data.
Third, schedule regular referee reviews using production data. As your system evolves, new failure modes will appear. Your referee portfolio should grow with them.
"A judge is a way to evaluate a model, it’s also a way to create railings, it’s also a way to have a metric against which you can do fast optimization, and it’s also a way to have a metric against which you can do reinforcement training," Frankel said. "Once you have a judge that you know represents your human taste in an empirical form that you can ask as many questions as you like, you can use it in 10,000 different ways to measure or improve your agents."
AI
#Databricks #research #reveals #building #referees #isnt #technical #concern #people #problem