Best Practices for Running LLM-Based Tests at Scale Without Hitting Rate Limits

Hi everyone,

I’m running a large number of LLM-based tests in parallel and consistently hitting OpenAI rate limits during execution. I’d love to learn what patterns or best practices people are using to handle this at scale.

In particular, I’m curious about:

  • How do you typically manage concurrency when running many LLM-powered tests?
  • Is using multiple OpenAI API keys ever considered a valid approach, or is that generally discouraged?

More context on my setup:

I have multiple datasets representing different workflows and expected outcomes. Each dataset contains traces that:

  1. Feed inputs into a deepagents-based agent
  2. Produce a trajectory (tool calls, arguments, messages, decisions, etc.)
  3. Are evaluated by one or more LLM judges that score the trajectory against expectations

The judges validate things like:

  • Correct tool usage and ordering
  • Arguments passed to tools
  • Content of generated messages
  • Decisions taken given specific tool outputs

Right now:

  • Tests are run locally with pytest
  • Tools are mocked to isolate backend dependencies
  • Scenarios Framework is used to structure the cases
  • Tests are executed in parallel, which is where rate limits start to hurt

This works well functionally, but once the test count grows, rate limiting becomes the bottleneck.

Thanks in advance. I would really appreciate any insights or references.