@kamran-rapidfireAI This is just my mental model since evaluations have a lot of different opinions, and I usually make full use of Langsmith for evaluations.
My Mental Model: Trace → Dataset → Evaluator → Experiment → Regression
Step 1: Triage the Trace, Classify Why, Not Just That
When I see a bad trace, I pin down the exact failure category before anything else:
- Retrieval miss → check the
retriever span: right docs returned? What were the scores?
- Bad tool call → check the
tool span: what args did the model pass? Was the schema ambiguous?
- Citation failure → check the final LLM span: did it have the right context but ignore it?
- Chunking/reranking off → what chunks arrived vs. what actually made it into the response?
I tag the trace immediately (retrieval-miss, bad-tool-args) so I can filter and group later.
Observability concepts & spans · Add metadata & tags to traces
Step 2: Build a Failure Dataset — Turn Anecdotes Into Benchmarks
Once I have 5–10 traces with the same failure, I select them in LangSmith → “Add to Dataset” → name it rag-citation-failures-march-2026.
Each row carries: original input, bad output, and expected output (if I know it). This is my ground truth — inputs that should break my current system, so I can measure when I’ve fixed it.
Create datasets from traces (UI) · Manage datasets programmatically · Evaluation concepts
Step 3: Write Targeted Evaluators — Make the Failure Measurable
I don’t use generic evaluators. I write ones that directly test the failure I saw (these are just examples):
# Retrieval relevance
def retrieval_relevance(run, example):
score = score_relevance(run.outputs["context"], example.inputs["question"])
return {"key": "retrieval_relevance", "score": score}
# Citation enforcement
def citation_check(run, example):
cited = any(s in run.outputs["answer"] for s in run.outputs["sources"])
return {"key": "has_citation", "score": int(cited)}
# Tool call validity
def tool_args_valid(run, example):
valid = validate_against_schema(run.outputs["tool_input"], example.metadata["expected_schema"])
return {"key": "tool_args_valid", "score": int(valid)}
Now I have a number, not a feeling.
How to define a code evaluator (SDK) · How to define an LLM-as-judge evaluator · Return multiple scores in one evaluator
Step 4: Run Experiments: One Variable at a Time
I use evaluate() and change one thing per run. No shotgun sweeps.
from langsmith import evaluate
evaluate(
chain_with_chunk_512,
data="rag-citation-failures-march-2026",
evaluators=[retrieval_relevance, citation_check],
experiment_prefix="chunk-512-baseline",
)
evaluate(
chain_with_chunk_256,
data="rag-citation-failures-march-2026",
evaluators=[retrieval_relevance, citation_check],
experiment_prefix="chunk-256-experiment",
)
LangSmith shows both side-by-side with per-example score deltas — I can see exactly which failure cases got fixed, and which didn’t.
I iterate the same way for: top_k, reranker on/off, prompt wording, tool schema rewrites.
How to evaluate an LLM application · Compare experiment results · Analyze an experiment
Step 5: A/B in Production: Validate on Real Traffic
Once a config wins offline, I shadow-test ~15% of real traffic before full rollout. Both versions log to LangSmith with metadata={"variant": "v1"} / {"variant": "v2"}. After a day, I filter by variant and compare evaluator scores. This catches distribution shift my failure dataset didn’t cover.
Filter traces in the application · Set up LLM-as-a-judge online evaluators · Set up automation rules
Step 6: Lock It as a Regression Test: Never Regress Silently (Again just an example)
# tests/test_regressions.py
def test_no_citation_regressions():
results = evaluate(
production_chain,
data="rag-citation-failures-march-2026",
evaluators=[citation_check],
)
avg_score = results.to_pandas()["has_citation"].mean()
assert avg_score >= 0.90, f"Citation quality dropped: {avg_score}"
Every future PR runs against my known hard cases automatically.
How to run evaluations with pytest · CI/CD pipeline example
The Summary Table
| Stage |
What I’m doing |
LangSmith feature |
| Trace triage |
Classify why it failed |
Spans, metadata tags |
| Dataset |
Turn failure cases into a benchmark |
Datasets |
| Evaluators |
Make the failure mode measurable |
Code / LLM-as-judge evaluators |
| Experiments |
Test one hypothesis at a time |
evaluate(), experiment comparison |
| A/B |
Validate on real traffic |
Metadata filtering, online eval |
| Regression suite |
Prevent silent regressions |
pytest + CI/CD integration |
The core principle I adhere to: a trace you close without adding to a dataset is a learning opportunity permanently lost. The whole loop only works when failures become data points.