Hi, all in the title, I’d be eager to build a pairwise evaluator based on the previous evaluations results, rather than outputs direct comparison. For instance I could write business rules indicating that I want to favour one metric and use the other as damage control in the pairwise experiment.
def ranked_preference(inputs: dict, outputs: list[dict]) -> list:
# Assumes example inputs have a 'question' key and experiment
# outputs have an 'answer' key.
response = chain.invoke({
"question": inputs["question"],
"answer_a": outputs[0].get("answer", "N/A"),
"answer_b": outputs[1].get("answer", "N/A"),
})
For instance in this code, does the “outputs” direct have more metadata to it I can use? Or should I look at runs rather than output? I am not sure whether evaluation results are found in the Run object though (Reference here).