We’re excited to share a new LangSmith feature we’re launching to make it easier to build high quality LLM-as-a-judge evaluators: Align Evals.
One big challenge we hear consistently from teams building evaluations is: “Our evaluation scores don’t match what we’d expect a human on our team to say.” Align evals helps you calibrate your evaluators to better match human preferences.
This feature gives you side-by-side comparison of human-graded data and LLM-generated scores and a playground-like interface to iterate on your evaluator prompt and see the evaluator’s alignment score.
Thanks for sharing, @tanushree-sharma —this looks very useful. The side-by-side human vs. LLM comparison should make calibrating evaluators much more straightforward. Does Align Evals also support custom rubrics (e.g., categorical or multi-criteria evaluations), or is it limited to numeric scoring for now?
Currently, we only support boolean evaluators and running this for a single evaluator. We do want to add support for categorical scores as well! For multi-criteria evaluations, would recommend breaking those up into distinct judges - have found independent judges tend to work better in practice than evaluating over multiple criteria in a single LLM-as-a-judge.
I was SO excited for this feature because I found it after I basically set up a more hacky way to go through this process with our Subject Matter Expert. I was so excited that I pushed our Platform team to upgrade our local hosted LangSmith instance to version 0.11.45 on our staging environment so I can start playing and using it for the next step of our project.
But sadly, I went to a dataset and I clicked on the Evaluation Button and saw now option for Create from Labeled Dataset option. The release notes indicate that this feature should be available at our current 0.11.45 version. Is this not true? Is there a setting we need to set to get this feature?
Excited to have you try this out! Thanks for flagging that this isn’t showing up in self hosted, we are looking into this.
In the meantime, you’ll have to bug your Platform team one more time (sorry!) to enable it for your org. They will need to run the following query in Postgres
update organizations set config = config || '{"enable_align_evaluators": true}' where id = '<org_id>';
I am trying to enable via Postgres as you suggested @tanushree-sharma .
I had to update the query a bit (because the field was jsonb):
update organizations set config = (config::jsonb || ‘{"enable_align_evaluators": true}’::jsonb)::json where id = ‘<org_id>’
But I can confirm the update in the config field of the organizations table was made.
However when I go to an existing dataset with experiments, I still don’t see the option (see screenshot). We are on 0.11.45 and we restarted the backend application.