Trouble understanding and editing experiment summary evaluators feedbacks

Hi!
I’m running into two related issues with experiment feedback in LangSmith and wanted to report them.

Issue context

I ran an experiment with a summary evaluator multiple times (had errors during the runs, so I needed to rerun failed examples). Because of this, the summary evaluator calculated feedback on partial data multiple times (particularly, I’m calculating precision, recall, f1 score, etc.). Therefore, now in the UI, I see wrong final numbers. Here is what I see in the UI:


When re-calculating the numbers myself based on the runs, I have precision 0.24, recall 0.26, f1-score 0.25, mcc 0.18.
Now I understand that having a summary evaluator after each segmented run was a sloppy decision, and I should only run it after I have all the needed examples run successfully. But I wanted to fix my old inconsistent results in the UI, so I performed an investigation on my own, and here are my findings.

Inconsistent averages between the /feedback endpoint and get_experiment_results

I figured that I can get all the experiment feedback through GET /feedback?session={experiment.id}. But when I fetch feedback via this endpoint and compute the average manually, I get a different result than the avg reported in feedback_stats from get_experiment_results and in the UI.

# Fetching raw feedback
experiment_feedbacks  = client.request_with_retries(
"GET",
f"/feedback?session={experiment.id}",
).json()
# Fetching feedback stats
experiment_feedback_stats = client.get_experiment_results(experiment.name)["feedback_stats"]
# Comparing averages for the same experiment
key = "precision"
key_stats = [f for f in experiment_feedbacks if f["key"] == key]
key_scores = [f["score"] for f in key_stats]
average_from_feedbacks = sum(key_scores)/len(key_scores) # 0.1724
average_from_stats = experiment_feedback_stats[key]["avg"] # 0.26782

Average precision from feedbacks: 0.17240, from stats: 0.26782, from the UI: 0.27.
Count of feedbacks in feedbacks: 1, Count from stats: 5

Could you clarify what set of feedback records the feedback_stats aggregation is based on, and how the denominator is determined? It looks like the two may be operating on different record sets, both averages and counts are different.

500 Internal Server Error when calling delete_feedback

Since I want to delete the feedback that was calculated on partial results, I figured that I could try to delete a feedback record using its ID from the GET response. However, I get a 500:

client.delete_feedback(feedback_id=key_stats[0]["id"])
# LangSmithAPIError: Server error (500) --- DELETE /feedback/...
# {"detail":"Internal server error"}

The ID comes directly from the /feedback?session=... response so it should be valid. Is deletion of experiment-level feedback supported? If not, is there another way to remove or correct feedback records that were created before a full dataset run was completed?

Environment: EU region, LangSmith SDK 0.7.38, Python 3.10.19, macOS-26.0.1-x86_64-i386-64bit

Thanks in advance!

Hey, thanks for the detailed writeup, these are two separate issues worth addressing individually.

On the inconsistent averages:

The confusion here comes down to two different feedback scopes in LangSmith. When you call get_experiment_results, the feedback_stats field aggregates run-level feedback, one score per dataset example, from row-level evaluators. Summary evaluators are different: they compute a single score across the entire experiment and land in a separate field called session_feedback_stats, not feedback_stats.

This is documented in two places:

  • The How to fetch performance metrics for an experiment guide shows the full experiment payload, which includes both feedback_stats (run-level) and session_feedback_stats (session-level) as distinct fields.
  • The Summary evaluators docs confirm that summary evaluators “compute metrics across an entire experiment rather than individual examples”, meaning their output is inherently session-scoped, not run-scoped.

So when you fetch /feedback?session={experiment.id} and manually average, you’re looking at session-level records (summary evaluator output, n=1), while feedback_stats is counting run-level records across all your partial reruns (n=5). Completely different sets with different denominators, that’s the discrepancy.

To read summary evaluator results correctly:

resp = client.read_project(project_name=experiment.experiment_name, include_stats=True)
print(resp.session_feedback_stats)  # summary evaluator feedback lives here

On the 500 error from delete_feedback:

This is a server-side limitation. The Feedback data format docs specify that every feedback record is tied to a run_id:

“run_id — Unique identifier for a specific run within a session”

Summary evaluator feedback is attached to a session-level entity, not a regular run, so delete_feedback, which expects a standard run-backed record, fails with a 500. The ID is valid; the API just doesn’t support this operation. There is no documented deletion path for session-level feedback records.

Your best workaround, based on the Log user feedback using the SDK docs, is to overwrite the stale values by submitting corrected feedback with the same key:

stale_record = [f for f in experiment_feedbacks if f["key"] == "precision"][0]

client.create_feedback(
    run_id=stale_record["run_id"],
    key="precision",
    score=0.24,  # your recalculated correct value
)

The UI surfaces the most recent feedback for a given key, so this effectively corrects what you’re seeing without a hard delete.

If you want a full wipe, the data purging docs note that deleting a trace cascades to delete all associated feedbacksl; but that means losing all your run data too, which is likely too destructive in your case.

For prevention going forward, and you already know this, only trigger summary evaluators once the full dataset run is complete. The evaluate() SDK docs support a resume pattern for rerunning failed examples without firing evaluators on partial data.

I’d also recommend filing a bug for the 500 on delete_feedback, a valid feedback ID from GET causing a server error on DELETE is worth tracking on LangSmith’s end.

I hope this helps , let me know if you have more questions.

Hi,

Thanks for the quick response! Good suggestion, but not everything lines up.

  1. The run-level vs session-level split doesn’t explain the numbers I’m seeing.

I don’t have any run-level precision feedback at all. My run-level evaluators only produce accuracy, fn, fp, etc. precision, recall, etc. is summary feedback.

When I inspect what /feedback?session={experiment.id} returns for one experiment (left-rabbit-12), grouped by key:

accuracy:        16 entries, mean 0.88
fn:              15 entries, mean 0.07
fp:              16 entries, mean 0.06
precision:        1 entry,   mean 0.17
recall:           1 entry,   mean 0.17

This experiment has 1903 successful runs, not 15 or 16. So whatever /feedback?session= is returning, it’s neither the full set of run-level feedback (would be ~1903) nor purely session-level (would be 1 or 5 - since we assume that I had 5 partial evaluations) — it’s some third thing, and it’s unclear where the single entry for the summary feedback comes from or where 15/16 comes from or what it represents.

  1. read_project(..., include_stats=True) only gives aggregates.

session_feedback_stats from read_project confirms precision is session-scoped (as you said), but it only exposes the averages — not the individual feedback records that went into them. So I still don’t have a way to list the partial summary feedbacks I need to clean up.

  1. create_feedback does not overwrite.

In my testing, posting a new feedback with the same key (and same run_id, or with a session reference) creates an additional record rather than replacing the previous one. The UI/feedback_stats then shows an average across all of them, including the stale partial-run entries. So overwriting isn’t a workaround in practice — the bad values stay in the aggregate.


Given all that, the original questions still stand and I would appreciate help:

  • What exactly is /feedback?session={experiment.id} returning? It’s clearly neither all run-level feedback nor only summary feedback — the 15/16-entry counts for run-level keys don’t match the 1903 runs in the experiment.
  • How do I list the individual summary-level feedback records for a session so I can delete the stale ones?
  • How do I delete (or genuinely overwrite) a session-level feedback record so the aggregate reflects only the correct run?

Thanks!

Hey, thanks for the detailed follow-up, you’re right to push back. Let me address each point properly.

On what /feedback?session={experiment.id} is actually returning:

The counts of 15/16 for your run-level keys are a pagination artifact, not the full set. The REST API returns a default page of results, not all 1903. To get everything, you need to paginate through:

all_feedbacks = []
offset = 0
limit = 100

while True:
    page = client.request_with_retries(
        "GET",
        f"/feedback",
        params={"session": experiment.id, "limit": limit, "offset": offset}
    ).json()
    if not page:
        break
    all_feedbacks.extend(page)
    offset += len(page)
    if len(page) < limit:
        break

As documented in the REST API guide, the /feedback endpoint accepts limit and offset params. Without them you’re just seeing the first page. The 1 entry for precision/recall is correct those are your session-level summary records, of which only one partial-run version is being surfaced per page.


On listing individual summary-level feedback records:

session_feedback_stats from read_project only gives you aggregates, you’re right. To get the actual individual records for a specific key, filter by key on the paginated feedback call:

summary_feedbacks = [
    f for f in all_feedbacks
    if f["key"] == "precision"
]
# Inspect run_id, score, created_at for each partial-run entry
for f in summary_feedbacks:
    print(f["id"], f["score"], f["created_at"])

This will show you all the stale partial-run summary records with their IDs and timestamps, which is what you need to identify the ones to clean up.


On create_feedback not overwriting:

You’re correct. this was wrong in my previous answer. Per the feedback data format docs, each feedback record is a unique entry with its own id. Posting a new one with the same key creates an additional record; the aggregate then averages all of them, including the stale ones. There is no documented upsert behavior.


The real fix for the stale summary evaluator state:

The cleanest path forward is to use the evaluate-existing-experiment approach, run your summary evaluator only on the final complete experiment, using cached traces, so no application re-execution is needed:

from langsmith import evaluate

evaluate(
    "left-rabbit-12",  # your experiment name or ID
    evaluators=[your_summary_evaluator]
)

This adds a new correct summary feedback record on top of the stale ones. It won’t delete the old records, but at least you get a correct entry in there.

For the actual deletion of the stale records, there is genuinely no supported path in the current API. The 500 on delete_feedback for session-level records is a server-side bug (the ID is valid, the endpoint just doesn’t handle session-scoped feedback). The data purging docs only document cascade-deletes via trace deletion, which is too destructive in your case.

I’d strongly recommend opening a support ticket directly with LangSmith (EU region) requesting:

  1. A fix for the 500 on delete_feedback for session-level feedback IDs
  2. A supported way to list and delete individual summary evaluator feedback records

Going forward — how to avoid this entirely:

The retry failed runs guide shows exactly the right pattern for your use case, run without summary evaluators first, retry only failed examples into the same experiment, then run the summary evaluator once everything is complete:

# Step 1: Run without summary evaluators, ignore errors
results = await client.aevaluate(
    target,
    data="your-dataset",
    evaluators=[row_level_evaluators],  # NO summary evaluators here
    error_handling='ignore'
)

# Step 2: Identify and retry failed examples into the same experiment
runs = client.list_runs(project_name=results.experiment_name)
successful_ids = [r.reference_example_id for r in runs]
failed_examples = [e for e in client.list_examples(dataset_name="your-dataset")
                   if e.id not in successful_ids]

await client.aevaluate(
    target,
    failed_examples,
    evaluators=[row_level_evaluators],
    experiment=results.experiment_name,
    error_handling='ignore'
)

# Step 3: Only NOW run the summary evaluator on the complete experiment
evaluate(results.experiment_name, evaluators=[your_summary_evaluator])

This way the summary evaluator fires exactly once, on the full completed dataset.