AI Evaluation

How to evaluate LLMs and AI agents

8 min read · Guide by humaineeti

An AI agent is only as good as its evaluations. Without them, drift, hallucinations, and quality regressions surface in production — in front of users, auditors, and regulators. This guide covers the metrics that matter for LLMs and agents, the role of LLM-as-judge, and how to make evaluation continuous rather than a one-off.

Why LLM evaluation is different

Traditional software has deterministic tests; LLMs are probabilistic and open-ended, so "correct" is rarely a single string. Agents compound the difficulty: they produce traces with many spans — tool calls, retrievals, planning steps — and useful evaluation means scoring each span, not just the final answer. That is why generic accuracy checks fall short.

The RAG triad

For retrieval-augmented generation, three metrics catch the most common failure modes. Context relevance checks whether the retriever fetched the right material; faithfulness checks whether every claim in the answer is grounded in that retrieved context (the primary defence against hallucination); and answer relevance checks whether the response actually addresses the user's question. Evaluate the retriever and the generator separately so you know which to fix.

  • Context precision & recall — did retrieval surface the right context?
  • Faithfulness — is every claim supported by the retrieved context?
  • Answer relevance — does the answer address the question?

Evaluating agents, safety, and operations

Beyond RAG, agent workflows need trajectory correctness, tool-call effectiveness, step efficiency, and recovery on failure. Safety-critical systems add hallucination rate, toxicity and PII-leak detection, prompt-injection robustness, and refusal correctness. And operations matter too: latency percentiles, cost per query, drift, and success rate determine whether a good model is actually shippable.

LLM-as-judge and human-in-the-loop

Because reference answers are scarce, LLM-as-judge — prompting a strong model to score outputs against criteria — has become the workhorse of modern evaluation, and for metrics like faithfulness it correlates well with human judgement. The strongest setups combine automated judging with human labeling on a sample, and add custom scorers for domain-specific bars such as financial accuracy or medical safety.

Make it continuous, not a one-off

Evaluation should run across the whole AI lifecycle: offline evals during development, CI/CD gates that fail a build on score regression, A/B comparison in production, and continuous monitoring that alerts on drift. This closed loop — trace, verify, score, retrain — turns evaluation from a launch checkbox into a flywheel that catches problems before users do.

Related accelerator

Eval@CoreAI Evaluation

A continuous evaluation framework for AI agents and LLM apps — trace, verify, score, retrain.

Explore Eval@Core

FAQ

Common questions

What is the RAG triad?+

The RAG triad is three metrics — context relevance, faithfulness, and answer relevance — that together catch the most common failure modes in retrieval-augmented generation by checking the retriever and generator separately.

What is LLM-as-judge?+

LLM-as-judge uses a strong language model to score another model's outputs against defined criteria. It scales evaluation when reference answers are scarce and, for metrics like faithfulness, correlates well with human judgement — ideally paired with human review on a sample.

How do you evaluate AI agents, not just single answers?+

Agents emit traces with multiple spans, so you score each step: trajectory correctness, tool-call effectiveness, step efficiency, and recovery on failure — alongside safety and operational metrics like latency, cost, and drift.

More guides