]I am having a problem implementing OpenAI-level rate limiting in LangChain.js.
I have explored the LangChain documentation and found two types of rate limiting:
- Agent-level rate limiting via middleware(for controlling 1 agent to not cause large cost) — available in both JavaScript and Python.
- In-memory rate limiting — available only in Python.
I am currently on OpenAI Tier 2, and I have not found any LangChain.js module where I can configure my account’s RPM (Requests Per Minute) and TPM (Tokens Per Minute) limits and have LangChain automatically manage requests within those limits.
Currently, I am using a retry-based mechanism for handling rate limit errors, but it is not fully reliable. There is still a possibility of exhausting the maximum retry count, which can result in failures for some users.
Is there a recommended approach in LangChain.js to implement robust rate limiting that handles both RPM and TPM, rather than relying solely on retries? If so, what would be the best practice for implementing it?
i have also not found any package that wraps around it.
hi @parshvJS
What I found (checked source)
No native RPM/TPM rate limiter in LangChain.js.
libs/langchain-core/src/language_models/*.ts - no rateLimiter field. Python’s InMemoryRateLimiter (langchain_core/rate_limiters.py) has no JS port.
Big correction: even Python limiter = RPM only, NOT TPM. Its own docstring:
“does not take into account any information about the input or the output, so it cannot be used to rate limit based on the size of the request”
So the hope that Python solves TPM = wrong. Neither SDK does TPM.
What JS actually gives (all on AsyncCallerParams, every chat model ctor - async_caller.ts):
maxConcurrency - proactive, caps in-flight (under-used lever)
maxRetries (default 6) - reactive 429 backoff, exp + jitter. Does NOT read Retry-After header - why retries get exhausted
modelRetryMiddleware for create_agent agents - still reactive
Real fix (no first-party package exists): proactive limiter + retries as backstop
bottleneck: two reservoirs - RPM (reservoir + reservoirRefreshInterval) and TPM (job weight = estimated tokens via js-tiktoken, include reserved output maxTokens)
- keep
maxRetries: 2–3 for estimate drift
- multi-process/serverless - Bottleneck Redis backend, else each instance thinks it owns full TPM
- set reservoirs ~90% of limit
Example:
Two parts: the limiter (port of Python), then wire it into createAgent via middleware - no core fork needed.
1. The limiter - inMemoryRateLimiter.ts
Port of langchain_core/rate_limiters.py. RPM-only token bucket. No lock (JS single-threaded). performance.now() = monotonic.
// inMemoryRateLimiter.ts
const sleep = (ms: number) => new Promise<void>((r) => setTimeout(r, ms));
export abstract class BaseRateLimiter {
/** Resolve true once a request slot is available. */
abstract acquire(opts?: { blocking?: boolean }): Promise<boolean>;
}
export interface InMemoryRateLimiterParams {
/** Tokens (request credits) added per second. RPM = requestsPerSecond * 60. */
requestsPerSecond?: number;
/** Poll interval while blocked, seconds. */
checkEveryNSeconds?: number;
/** Max bucket size - caps burst. Must be >= 1. */
maxBucketSize?: number;
}
/**
* In-memory, single-process, request-based rate limiter (token bucket).
*
* Limitations (same as the Python original):
* - In-memory only - does NOT coordinate across processes/instances.
* - Time-based on REQUESTS only. Knows nothing about prompt/response size,
* so it cannot enforce TPM. These "tokens" are request credits, not LLM tokens.
*/
export class InMemoryRateLimiter extends BaseRateLimiter {
private requestsPerSecond: number;
private checkEveryNSeconds: number;
private maxBucketSize: number;
private availableTokens = 0;
private last: number | null = null;
constructor(params: InMemoryRateLimiterParams = {}) {
super();
this.requestsPerSecond = params.requestsPerSecond ?? 1;
this.checkEveryNSeconds = params.checkEveryNSeconds ?? 0.1;
this.maxBucketSize = params.maxBucketSize ?? 1;
if (this.maxBucketSize < 1) throw new Error("maxBucketSize must be >= 1");
}
/** Try to consume one credit. Non-blocking. */
private consume(): boolean {
const now = performance.now() / 1000; // seconds, monotonic
if (this.last === null) this.last = now; // init without an initial burst
const elapsed = now - this.last;
if (elapsed * this.requestsPerSecond >= 1) {
this.availableTokens += elapsed * this.requestsPerSecond;
this.last = now;
}
this.availableTokens = Math.min(this.availableTokens, this.maxBucketSize);
if (this.availableTokens >= 1) {
this.availableTokens -= 1;
return true;
}
return false;
}
async acquire({ blocking = true }: { blocking?: boolean } = {}): Promise<boolean> {
if (!blocking) return this.consume();
while (!this.consume()) {
await sleep(this.checkEveryNSeconds * 1000);
}
return true;
}
}
2. Wire into the agent
JS chat models have no rateLimiter constructor field (unlike Python), so wire it via a beforeModel middleware - it runs before every model call inside createAgent.
import { createAgent, createMiddleware } from "langchain";
import { ChatOpenAI } from "@langchain/openai";
import { InMemoryRateLimiter } from "./inMemoryRateLimiter.js";
// Tier-2 example: 5000 RPM = ~83.3 req/s. Sit under it (~90%).
const limiter = new InMemoryRateLimiter({
requestsPerSecond: 75, // ~4500 RPM
checkEveryNSeconds: 0.05,
maxBucketSize: 75, // allow a 1s burst, then steady
});
const rateLimitMiddleware = createMiddleware({
name: "RateLimit",
beforeModel: async () => {
await limiter.acquire({ blocking: true }); // gate every model call
},
});
const agent = createAgent({
model: new ChatOpenAI({ model: "gpt-4o", maxRetries: 3 }), // retries = backstop
tools: [/* ... */],
middleware: [rateLimitMiddleware],
});
await agent.invoke({ messages: [{ role: "user", content: "hi" }] });
beforeModel fires once per model step. An agent turn with tool calls = multiple model calls = multiple acquire()s. Correct for RPM
- RPM only. Still no TPM. Big prompts can blow Tier-2 TPM while RPM is fine → keep
maxRetries for the residual 429s. For true TPM use the bottleneck weighted-reservoir approach from the earlier doc
- Single process only. Multiple instances/lambdas each run their own bucket → real limit = N × your setting. Multi-instance needs a shared store (Redis)
wrapModelCall alternative: if you also want to retry after the gate, use wrapModelCall: async (req, handler) => { await limiter.acquire(); return handler(req); } instead of beforeModel