Close Menu
    What's Hot

    Why Did The Bitcoin Price Crash Below $100,000? The Bear Market Is Here

    November 5, 2025

    The Hyundai Metaplant: A New Era in EV Manufacturing

    November 5, 2025

    SEC Puts Crypto Treasury Probe on Hold During Shutdown

    November 5, 2025
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    CryptoMarketVision
    • Home
    • AI News
    • Altcoin
    • Bitcoin
    • Business
    • Market Analysis
    • Mining
    • Trending Cryptos
    • Moneyprofitt
    • More
      • About Us
      • Contact Us
      • Terms and Conditions
      • Privacy Policy
      • Disclaimer
    CryptoMarketVision
    Home»AI News»Databricks research reveals that building better AI judges isn't just a technical concern, it's a people problem
    Databricks research reveals that building better AI judges isn't just a technical concern, it's a people problem
    AI News

    Databricks research reveals that building better AI judges isn't just a technical concern, it's a people problem

    adminBy adminNovember 5, 2025No Comments7 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email



    The intelligence of AI models isn't what's blocking enterprise deployments. It's the inability to define and measure quality in the first place.

    That's where AI judges are now playing an increasingly important role. In AI evaluation, a "judge" is an AI system that scores outputs from another AI system. 

    Judge Builder is Databricks' framework for creating judges and was first deployed as part of the company's Agent Bricks technology earlier this year. The framework has evolved significantly since its initial launch in response to direct user feedback and deployments.

    Early versions focused on technical implementation but customer feedback revealed the real bottleneck was organizational alignment. Databricks now offers a structured workshop process that guides teams through three core challenges: getting stakeholders to agree on quality criteria, capturing domain expertise from limited subject matter experts and deploying evaluation systems at scale.

    "The intelligence of the model is typically not the bottleneck, the models are really smart," Jonathan Frankle, Databricks' chief AI scientist, told VentureBeat in an exclusive briefing. "Instead, it's really about asking, how do we get the models to do what we want, and how do we know if they did what we wanted?"

    The 'Ouroboros problem' of AI evaluation

    Judge Builder addresses what Pallavi Koppol, a Databricks research scientist who led the development, calls the "Ouroboros problem."  An Ouroboros is an ancient symbol that depicts a snake eating its own tail. 

    Using AI systems to evaluate AI systems creates a circular validation challenge.

    "You want a judge to see if your system is good, if your AI system is good, but then your judge is also an AI system," Koppol explained. "And now you're saying like, well, how do I know this judge is good?"

    The solution is measuring "distance to human expert ground truth" as the primary scoring function. By minimizing the gap between how an AI judge scores outputs versus how domain experts would score them, organizations can trust these judges as scalable proxies for human evaluation.

    This approach differs fundamentally from traditional guardrail systems or single-metric evaluations. Rather than asking whether an AI output passed or failed on a generic quality check, Judge Builder creates highly specific evaluation criteria tailored to each organization's domain expertise and business requirements.

    The technical implementation also sets it apart. Judge Builder integrates with Databricks' MLflow and prompt optimization tools and can work with any underlying model. Teams can version control their judges, track performance over time and deploy multiple judges simultaneously across different quality dimensions.

    Lessons learned: Building judges that actually work

    Databricks' work with enterprise customers revealed three critical lessons that apply to anyone building AI judges.

    Lesson one: Your experts don't agree as much as you think. When quality is subjective, organizations discover that even their own subject matter experts disagree on what constitutes acceptable output. A customer service response might be factually correct but use an inappropriate tone. A financial summary might be comprehensive but too technical for the intended audience.

    "One of the biggest lessons of this whole process is that all problems become people problems," Frankle said. "The hardest part is getting an idea out of a person's brain and into something explicit. And the harder part is that companies are not one brain, but many brains."

    The fix is batched annotation with inter-rater reliability checks. Teams annotate examples in small groups, then measure agreement scores before proceeding. This catches misalignment early. In one case, three experts gave ratings of 1, 5 and neutral for the same output before discussion revealed they were interpreting the evaluation criteria differently.

    Companies using this approach achieve inter-rater reliability scores as high as 0.6 compared to typical scores of 0.3 from external annotation services. Higher agreement translates directly to better judge performance because the training data contains less noise.

    Lesson two: Break down vague criteria into specific judges. Instead of one judge evaluating whether a response is "relevant, factual and concise," create three separate judges. Each targets a specific quality aspect. This granularity matters because a failing "overall quality" score reveals something is wrong but not what to fix.

    The best results come from combining top-down requirements such as regulatory constraints, stakeholder priorities, with bottom-up discovery of observed failure patterns. One customer built a top-down judge for correctness but discovered through data analysis that correct responses almost always cited the top two retrieval results. This insight became a new production-friendly judge that could proxy for correctness without requiring ground-truth labels.

    Lesson three: You need fewer examples than you think. Teams can create robust judges from just 20-30 well-chosen examples. The key is selecting edge cases that expose disagreement rather than obvious examples where everyone agrees.

    "We're able to run this process with some teams in as little as three hours, so it doesn't really take that long to start getting a good judge," Koppol said.

    Production results: From pilots to seven-figure deployments

    Frankle shared three metrics Databricks uses to measure Judge Builder's success: whether customers want to use it again, whether they increase AI spending and whether they progress further in their AI journey.

    On the first metric, one customer created more than a dozen judges after their initial workshop. "This customer made more than a dozen judges after we walked them through doing this in a rigorous way for the first time with this framework," Frankle said. "They really went to town on judges and are now measuring everything."

    For the second metric, the business impact is clear. "There are multiple customers who have gone through this workshop and have become seven-figure spenders on GenAI at Databricks in a way that they weren't before," Frankle said.

    The third metric reveals Judge Builder's strategic value. Customers who previously hesitated to use advanced techniques like reinforcement learning now feel confident deploying them because they can measure whether improvements actually occurred.

    "There are customers who have gone and done very advanced things after having had these judges where they were reluctant to do so before," Frankle said. "They've moved from doing a little bit of prompt engineering to doing reinforcement learning with us. Why spend the money on reinforcement learning, and why spend the energy on reinforcement learning if you don't know whether it actually made a difference?"

    What enterprises should do now

    The teams successfully moving AI from pilot to production treat judges not as one-time artifacts but as evolving assets that grow with their systems.

    Databricks recommends three practical steps. First, focus on high-impact judges by identifying one critical regulatory requirement plus one observed failure mode. These become your initial judge portfolio.

    Second, create lightweight workflows with subject matter experts. A few hours reviewing 20-30 edge cases provides sufficient calibration for most judges. Use batched annotation and inter-rater reliability checks to denoise your data.

    Third, schedule regular judge reviews using production data. New failure modes will emerge as your system evolves. Your judge portfolio should evolve with them.

    "A judge is a way to evaluate a model, it's also a way to create guardrails, it's also a way to have a metric against which you can do prompt optimization and it's also a way to have a metric against which you can do reinforcement learning," Frankle said. "Once you have a judge that you know represents your human taste in an empirical form that you can query as much as you want, you can use it in 10,000 different ways to measure or improve your agents."



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    admin
    • Website

    Related Posts

    The Hyundai Metaplant: A New Era in EV Manufacturing

    November 5, 2025

    Snowflake builds new intelligence that goes beyond RAG to query and aggregate thousands of documents at once

    November 4, 2025

    OpenAI spreads $600B cloud AI bet across AWS, Oracle, Microsoft

    November 4, 2025

    SKATE Enhances Volcano Monitoring Safety

    November 3, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Why Did The Bitcoin Price Crash Below $100,000? The Bear Market Is Here

    November 5, 2025

    The Hyundai Metaplant: A New Era in EV Manufacturing

    November 5, 2025

    SEC Puts Crypto Treasury Probe on Hold During Shutdown

    November 5, 2025

    Subscribe to Updates

    Get the latest sports news from SportsSite about soccer, football and tennis.

    Welcome to Crypto Market Vision – your trusted source for everything crypto Our mission is simple: to make the world of cryptocurrency clear, accessible, and actionable for everyone. Whether you are a beginner exploring Bitcoin for the first time or a seasoned trader looking for market insights, our goal is to keep you informed, empowered, and ahead of the curve.

    Facebook X (Twitter) Instagram Pinterest YouTube
    Top Insights

    Why Did The Bitcoin Price Crash Below $100,000? The Bear Market Is Here

    November 5, 2025

    The Hyundai Metaplant: A New Era in EV Manufacturing

    November 5, 2025

    SEC Puts Crypto Treasury Probe on Hold During Shutdown

    November 5, 2025
    Get Informed

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • Contact Us
    • About Us
    • Terms and Conditions
    • Privacy Policy
    • Disclaimer

    © 2025 cryptomarketvision.com. All rights reserved. Designed by DD.

    Type above and press Enter to search. Press Esc to cancel.

    ethereum
    Ethereum (ETH) $ 3,462.29
    tether
    Tether (USDT) $ 1.00
    bitcoin
    Bitcoin (BTC) $ 103,960.85
    xrp
    XRP (XRP) $ 2.35
    bnb
    BNB (BNB) $ 956.73
    solana
    Wrapped SOL (SOL) $ 162.84
    usd-coin
    USDC (USDC) $ 0.999972
    dogecoin
    Dogecoin (DOGE) $ 0.167353