Trace-to-Fix: how are you actually improving RAG/agents after observability flags issues?

I’ve been looking at the agent/LLM observability space lately (LangSmith, Arize, Braintrust, Datadog LLM Observability, etc.). Traces are great at showing what failed and where it failed.

What I’m still curious about is the step after that:

How do you go from “I see the failure in the trace” to “I found the fix” in a repeatable way?

Examples of trace-level issues I mean:

  • Retrieval returns low-quality context or misses key docs

  • Citation enforcement fails or the model does not cite what it uses

  • Tool calls have bad parameters or the agent picks the wrong tool

  • Reranking or chunking choices look off in hindsight

Do you:

  • Write custom scripts to sweep params (chunk size, top-k, rerankers, prompts, tool policies)?

  • Add failing traces to a dataset and run experiments?

  • A/B prompts in production?

  • Maintain a regression suite of traces?

  • Something else?

Would love to hear the practical workflow people are actually using.

@kamran-rapidfireAI This is just my mental model since evaluations have a lot of different opinions, and I usually make full use of Langsmith for evaluations.

My Mental Model: Trace → Dataset → Evaluator → Experiment → Regression


Step 1: Triage the Trace, Classify Why, Not Just That

When I see a bad trace, I pin down the exact failure category before anything else:

  • Retrieval miss → check the retriever span: right docs returned? What were the scores?
  • Bad tool call → check the tool span: what args did the model pass? Was the schema ambiguous?
  • Citation failure → check the final LLM span: did it have the right context but ignore it?
  • Chunking/reranking off → what chunks arrived vs. what actually made it into the response?

I tag the trace immediately (retrieval-miss, bad-tool-args) so I can filter and group later.

Observability concepts & spans · Add metadata & tags to traces


Step 2: Build a Failure Dataset — Turn Anecdotes Into Benchmarks

Once I have 5–10 traces with the same failure, I select them in LangSmith → “Add to Dataset” → name it rag-citation-failures-march-2026.

Each row carries: original input, bad output, and expected output (if I know it). This is my ground truth — inputs that should break my current system, so I can measure when I’ve fixed it.

Create datasets from traces (UI) · Manage datasets programmatically · Evaluation concepts


Step 3: Write Targeted Evaluators — Make the Failure Measurable

I don’t use generic evaluators. I write ones that directly test the failure I saw (these are just examples):

# Retrieval relevance
def retrieval_relevance(run, example):
    score = score_relevance(run.outputs["context"], example.inputs["question"])
    return {"key": "retrieval_relevance", "score": score}

# Citation enforcement
def citation_check(run, example):
    cited = any(s in run.outputs["answer"] for s in run.outputs["sources"])
    return {"key": "has_citation", "score": int(cited)}

# Tool call validity
def tool_args_valid(run, example):
    valid = validate_against_schema(run.outputs["tool_input"], example.metadata["expected_schema"])
    return {"key": "tool_args_valid", "score": int(valid)}

Now I have a number, not a feeling.

How to define a code evaluator (SDK) · How to define an LLM-as-judge evaluator · Return multiple scores in one evaluator


Step 4: Run Experiments: One Variable at a Time

I use evaluate() and change one thing per run. No shotgun sweeps.

from langsmith import evaluate

evaluate(
    chain_with_chunk_512,
    data="rag-citation-failures-march-2026",
    evaluators=[retrieval_relevance, citation_check],
    experiment_prefix="chunk-512-baseline",
)

evaluate(
    chain_with_chunk_256,
    data="rag-citation-failures-march-2026",
    evaluators=[retrieval_relevance, citation_check],
    experiment_prefix="chunk-256-experiment",
)

LangSmith shows both side-by-side with per-example score deltas — I can see exactly which failure cases got fixed, and which didn’t.

I iterate the same way for: top_k, reranker on/off, prompt wording, tool schema rewrites.

How to evaluate an LLM application · Compare experiment results · Analyze an experiment


Step 5: A/B in Production: Validate on Real Traffic

Once a config wins offline, I shadow-test ~15% of real traffic before full rollout. Both versions log to LangSmith with metadata={"variant": "v1"} / {"variant": "v2"}. After a day, I filter by variant and compare evaluator scores. This catches distribution shift my failure dataset didn’t cover.

Filter traces in the application · Set up LLM-as-a-judge online evaluators · Set up automation rules


Step 6: Lock It as a Regression Test: Never Regress Silently (Again just an example)

# tests/test_regressions.py
def test_no_citation_regressions():
    results = evaluate(
        production_chain,
        data="rag-citation-failures-march-2026",
        evaluators=[citation_check],
    )
    avg_score = results.to_pandas()["has_citation"].mean()
    assert avg_score >= 0.90, f"Citation quality dropped: {avg_score}"

Every future PR runs against my known hard cases automatically.

How to run evaluations with pytest · CI/CD pipeline example


The Summary Table

Stage What I’m doing LangSmith feature
Trace triage Classify why it failed Spans, metadata tags
Dataset Turn failure cases into a benchmark Datasets
Evaluators Make the failure mode measurable Code / LLM-as-judge evaluators
Experiments Test one hypothesis at a time evaluate(), experiment comparison
A/B Validate on real traffic Metadata filtering, online eval
Regression suite Prevent silent regressions pytest + CI/CD integration

The core principle I adhere to: a trace you close without adding to a dataset is a learning opportunity permanently lost. The whole loop only works when failures become data points.

The trace-to-fix gap is real and most observability tools stop at the wrong layer. The pattern we see most often: LangSmith shows a retrieval step returned low-relevance chunks. The instinct is to tune the retriever — adjust top-k, try a reranker, switch embedding models. Three weeks later, marginal improvement.

The actual fix is usually upstream. The chunks themselves are the problem — incomplete context, low semantic density, orphaned fragments that score well on similarity but contain almost no useful information.

The workflow that actually closes the trace-to-fix loop:

  1. When a trace flags a bad retrieval, pull the specific chunks that were returned.
  2. Score them independently for completeness and semantic density — not retrieval relevance, but intrinsic quality.
  3. If they score below 0.6 on completeness, the retriever did its job correctly. It retrieved the most similar
    content. The content was just bad.
  4. Fix the chunking strategy or re-chunk those documents before re-embedding.

The observability gap isn’t in the tracing, it’s that traces measure what happened, not whether the underlying
data was worth retrieving in the first place. Adding a pre-embedding quality gate closes that loop faster than
any retrieval tuning will.