Trace-to-Fix: how are you actually improving RAG/agents after observability flags issues?

I’ve been looking at the agent/LLM observability space lately (LangSmith, Arize, Braintrust, Datadog LLM Observability, etc.). Traces are great at showing what failed and where it failed.

What I’m still curious about is the step after that:

How do you go from “I see the failure in the trace” to “I found the fix” in a repeatable way?

Examples of trace-level issues I mean:

  • Retrieval returns low-quality context or misses key docs

  • Citation enforcement fails or the model does not cite what it uses

  • Tool calls have bad parameters or the agent picks the wrong tool

  • Reranking or chunking choices look off in hindsight

Do you:

  • Write custom scripts to sweep params (chunk size, top-k, rerankers, prompts, tool policies)?

  • Add failing traces to a dataset and run experiments?

  • A/B prompts in production?

  • Maintain a regression suite of traces?

  • Something else?

Would love to hear the practical workflow people are actually using.

@kamran-rapidfireAI This is just my mental model since evaluations have a lot of different opinions, and I usually make full use of Langsmith for evaluations.

My Mental Model: Trace → Dataset → Evaluator → Experiment → Regression


Step 1: Triage the Trace, Classify Why, Not Just That

When I see a bad trace, I pin down the exact failure category before anything else:

  • Retrieval miss → check the retriever span: right docs returned? What were the scores?
  • Bad tool call → check the tool span: what args did the model pass? Was the schema ambiguous?
  • Citation failure → check the final LLM span: did it have the right context but ignore it?
  • Chunking/reranking off → what chunks arrived vs. what actually made it into the response?

I tag the trace immediately (retrieval-miss, bad-tool-args) so I can filter and group later.

Observability concepts & spans · Add metadata & tags to traces


Step 2: Build a Failure Dataset — Turn Anecdotes Into Benchmarks

Once I have 5–10 traces with the same failure, I select them in LangSmith → “Add to Dataset” → name it rag-citation-failures-march-2026.

Each row carries: original input, bad output, and expected output (if I know it). This is my ground truth — inputs that should break my current system, so I can measure when I’ve fixed it.

Create datasets from traces (UI) · Manage datasets programmatically · Evaluation concepts


Step 3: Write Targeted Evaluators — Make the Failure Measurable

I don’t use generic evaluators. I write ones that directly test the failure I saw (these are just examples):

# Retrieval relevance
def retrieval_relevance(run, example):
    score = score_relevance(run.outputs["context"], example.inputs["question"])
    return {"key": "retrieval_relevance", "score": score}

# Citation enforcement
def citation_check(run, example):
    cited = any(s in run.outputs["answer"] for s in run.outputs["sources"])
    return {"key": "has_citation", "score": int(cited)}

# Tool call validity
def tool_args_valid(run, example):
    valid = validate_against_schema(run.outputs["tool_input"], example.metadata["expected_schema"])
    return {"key": "tool_args_valid", "score": int(valid)}

Now I have a number, not a feeling.

How to define a code evaluator (SDK) · How to define an LLM-as-judge evaluator · Return multiple scores in one evaluator


Step 4: Run Experiments: One Variable at a Time

I use evaluate() and change one thing per run. No shotgun sweeps.

from langsmith import evaluate

evaluate(
    chain_with_chunk_512,
    data="rag-citation-failures-march-2026",
    evaluators=[retrieval_relevance, citation_check],
    experiment_prefix="chunk-512-baseline",
)

evaluate(
    chain_with_chunk_256,
    data="rag-citation-failures-march-2026",
    evaluators=[retrieval_relevance, citation_check],
    experiment_prefix="chunk-256-experiment",
)

LangSmith shows both side-by-side with per-example score deltas — I can see exactly which failure cases got fixed, and which didn’t.

I iterate the same way for: top_k, reranker on/off, prompt wording, tool schema rewrites.

How to evaluate an LLM application · Compare experiment results · Analyze an experiment


Step 5: A/B in Production: Validate on Real Traffic

Once a config wins offline, I shadow-test ~15% of real traffic before full rollout. Both versions log to LangSmith with metadata={"variant": "v1"} / {"variant": "v2"}. After a day, I filter by variant and compare evaluator scores. This catches distribution shift my failure dataset didn’t cover.

Filter traces in the application · Set up LLM-as-a-judge online evaluators · Set up automation rules


Step 6: Lock It as a Regression Test: Never Regress Silently (Again just an example)

# tests/test_regressions.py
def test_no_citation_regressions():
    results = evaluate(
        production_chain,
        data="rag-citation-failures-march-2026",
        evaluators=[citation_check],
    )
    avg_score = results.to_pandas()["has_citation"].mean()
    assert avg_score >= 0.90, f"Citation quality dropped: {avg_score}"

Every future PR runs against my known hard cases automatically.

How to run evaluations with pytest · CI/CD pipeline example


The Summary Table

Stage What I’m doing LangSmith feature
Trace triage Classify why it failed Spans, metadata tags
Dataset Turn failure cases into a benchmark Datasets
Evaluators Make the failure mode measurable Code / LLM-as-judge evaluators
Experiments Test one hypothesis at a time evaluate(), experiment comparison
A/B Validate on real traffic Metadata filtering, online eval
Regression suite Prevent silent regressions pytest + CI/CD integration

The core principle I adhere to: a trace you close without adding to a dataset is a learning opportunity permanently lost. The whole loop only works when failures become data points.