Hi!
I’m running into two related issues with experiment feedback in LangSmith and wanted to report them.
Issue context
I ran an experiment with a summary evaluator multiple times (had errors during the runs, so I needed to rerun failed examples). Because of this, the summary evaluator calculated feedback on partial data multiple times (particularly, I’m calculating precision, recall, f1 score, etc.). Therefore, now in the UI, I see wrong final numbers. Here is what I see in the UI:
When re-calculating the numbers myself based on the runs, I have precision 0.24, recall 0.26, f1-score 0.25, mcc 0.18.
Now I understand that having a summary evaluator after each segmented run was a sloppy decision, and I should only run it after I have all the needed examples run successfully. But I wanted to fix my old inconsistent results in the UI, so I performed an investigation on my own, and here are my findings.
Inconsistent averages between the /feedback endpoint and get_experiment_results
I figured that I can get all the experiment feedback through GET /feedback?session={experiment.id}. But when I fetch feedback via this endpoint and compute the average manually, I get a different result than the avg reported in feedback_stats from get_experiment_results and in the UI.
# Fetching raw feedback
experiment_feedbacks = client.request_with_retries(
"GET",
f"/feedback?session={experiment.id}",
).json()
# Fetching feedback stats
experiment_feedback_stats = client.get_experiment_results(experiment.name)["feedback_stats"]
# Comparing averages for the same experiment
key = "precision"
key_stats = [f for f in experiment_feedbacks if f["key"] == key]
key_scores = [f["score"] for f in key_stats]
average_from_feedbacks = sum(key_scores)/len(key_scores) # 0.1724
average_from_stats = experiment_feedback_stats[key]["avg"] # 0.26782
Average precision from feedbacks: 0.17240, from stats: 0.26782, from the UI: 0.27.
Count of feedbacks in feedbacks: 1, Count from stats: 5
Could you clarify what set of feedback records the feedback_stats aggregation is based on, and how the denominator is determined? It looks like the two may be operating on different record sets, both averages and counts are different.
500 Internal Server Error when calling delete_feedback
Since I want to delete the feedback that was calculated on partial results, I figured that I could try to delete a feedback record using its ID from the GET response. However, I get a 500:
client.delete_feedback(feedback_id=key_stats[0]["id"])
# LangSmithAPIError: Server error (500) --- DELETE /feedback/...
# {"detail":"Internal server error"}
The ID comes directly from the /feedback?session=... response so it should be valid. Is deletion of experiment-level feedback supported? If not, is there another way to remove or correct feedback records that were created before a full dataset run was completed?
Environment: EU region, LangSmith SDK 0.7.38, Python 3.10.19, macOS-26.0.1-x86_64-i386-64bit
Thanks in advance!
