When AI Judges AI: Notes on Scoring Player Conversations

When AI Judges AI: Notes on Scoring Player Conversations

By Mendel Blesofsky

January 24, 2026

At Cevro AI we build AI agents for player support. Part of that work involves something people don't think about much: building AI that evaluates AI. Automated QA that can score hundreds of thousands of conversations across brands, languages, and channels.

You learn things doing this. Here are a few (Proceed with mild caution).

Language Models Are Translators, Not Calculators

Ask any LLM: "Is 23 less than 20?" Correct answer, every time.

Now bury that question in 2,000 tokens of iGaming business logic. Eligibility rules, tiered thresholds, date comparisons, conditional branches based on player segment. Suddenly you get wrong answers. Not always, but often enough to ruin your afternoon.

This isn't the model being stupid. It's us asking it to do the wrong kind of work.

Here's what's actually going on: language models predict the next token based on patterns learned from text. They're exceptional at tasks that look like language (understanding intent, maintaining context, generating responses that fit the vibe). Arithmetic buried in prose is a different beast entirely. The model isn't "calculating" 23 < 20. It's pattern-matching against training examples where numbers appeared in similar contexts. When the surrounding noise gets complex enough, those patterns get unreliable.

Plato once said "know thyself, and also know thy tools." (Don't fact check that). Understanding what happened in a conversation? Let the model cook. Checking whether a number exceeds a threshold? Don't make it guess.

Implicit Logic Is the Enemy

iGaming support runs on business rules that look simple:

"A player qualifies for a promotional credit if they haven't received one this week, their account is in good standing, and their activity ratio meets the tier threshold."

Three conditions. Should be straightforward. (Narrator voice: it was not straightforward.)

"This week" in which timezone? "Good standing" checks how many flags across how many systems? "Tier threshold" varies by segment, currency, region. What looked like three conditions is actually a decision tree with a dozen branches, and you've just asked your scorer to figure it all out from context clues.

Natural language hides complexity. That's the whole point of natural language. But hidden complexity kills reliable evaluation. You need every branch visible, every comparison explicit.

We've come to appreciate making things tediously, almost comically obvious. A system that shows all its work is a system you can debug when something goes wrong. And something always goes wrong.

Time Matters More Than You'd Think

When do you evaluate a conversation?

Scoring happens later. Sometimes minutes after a chat ends, sometimes hours. And in iGaming, player state changes fast. Deposits land. Withdrawals process. Account flags flip.

Here's the trap: if you pull current data to judge a past decision, you get weird results. Say, an agent tells a player their withdrawal is pending verification at 2:15 PM. Because at 2:15 PM, it was. Scoring runs at 3:00 PM. By then, the withdrawal went through. Now the scorer sees a completed withdrawal and flags the agent for saying it was pending.

The agent was right. Penalized for something that changed after the conversation ended.

You have to reason about what the world looked like when the conversation happened, not when you're scoring it. Sounds obvious when you say it out loud. Somehow easy to miss when you're in the weeds.

QA Isn't an Afterthought

There's a natural temptation to treat quality scoring as something you bolt on later. Build the agent, get it responding, then figure out how to measure whether it's any good.

The problem is: if your QA can't reliably tell correct from incorrect, you can't improve. You iterate, you ship changes, you watch metrics move, but you're not actually learning anything. The feedback loop is broken. You optimize for signals that look good on a dashboard but don't correlate with players actually getting helped.

In iGaming, this matters more than most domains. Players are asking about their money. Their withdrawals. Their account status. Their losses. A wrong answer isn't just a bad experience, it's a trust violation. Sometimes it's a compliance issue. The bar for "good enough" is higher, and the cost of being wrong is real.

QA that actually works isn't a nice-to-have. It's half the product.

Know What You Don't Know

The most useful output from a scoring system isn't always a score. Sometimes data is missing. Sometimes the business rule is genuinely ambiguous. Sometimes a conversation falls outside anything you've mapped.

A system that always produces a confident number, no matter what, will mislead you. Better to say "I can't evaluate this one reliably, here's why." That tells you where your rules have gaps, where your data has holes, where your agent is handling situations you haven't thought through yet.

Confidence is easy to generate. Calibrated confidence is hard. Knowing the edges of what you can actually measure is more useful than pretending you can measure everything.

Why We Care

Every player should get VIP treatment. That means every conversation, whether a human or AI handles it, should hit the same bar. Reliable scoring is what makes that possible when you're operating at scale.

iGaming player support is our bread and butter. The eligibility logic, the regional compliance quirks, the way frustrated players actually talk when money is involved. Our QA reflects that.

If you're building AI for player support, or trying to figure out how to maintain quality as volume grows, we should chat. The hard problems are the interesting ones.

Learn more fun tidbits at Cevro Academy, or follow our blog for more AI insights.

About Mendel Blesofsky

Mendel is a Co-Founder and CTO at Cevro AI. He leads the technical vision for Cevro, developing advanced AI agents that are transforming the customer support landscape.