Pydantic to OpenAI Structured Outputs Compatibility Solution (with implementation)
I’ve developed a comprehensive solution (available as a MIT-licensed gist) for the widespread compatibility issues between Pydantic models and OpenAI’s Structured Outputs that I believe should be integrated into LangChain to benefit the entire community.
The Problem
Many LangChain users encounter frustrating errors when using .with_structured_output() with OpenAI’s strict JSON schema mode. The root cause is that OpenAI’s Structured Outputs only support a very limited subset of JSON Schema, while Pydantic generates rich, full-featured schemas. This mismatch causes failures for common patterns:
-
Optional fields (
Optional[str] = None) generate{"type": ["string", "null"]}which OpenAI rejects -
Numeric constraints (
Field(ge=0, le=100)) useminimum/maximumkeywords that OpenAI doesn’t support -
Recursive models (tree structures, linked lists) use
$refwhich OpenAI explicitly forbids -
Union types generate
anyOf/oneOfwhich aren’t supported in strict mode -
Missing or empty
additionalPropertiescause validation errors
Currently, developers must either rewrite their Pydantic models (breaking compatibility with other providers) or manually craft OpenAI-specific schemas (error-prone and tedious).
The Solution
I’ve created a sanitize_for_openai_schema() function that automatically transforms any Pydantic model into an OpenAI-compatible schema. The implementation:
-
Converts optionals to OpenAI’s preferred required+nullable pattern
-
Detects recursive models early and fails with helpful error messages
-
Migrates constraints to descriptions for app-side validation
-
Collapses unions intelligently while preserving nullability
-
Fixes
additionalPropertiesto always have proper typing -
Preserves field order from the original Pydantic model
Integration with LangChain
This could be integrated into LangChain in several ways:
# Option 1: Automatic in with_structured_output
llm.with_structured_output(
MyModel,
method="json_schema",
strict=True,
sanitize_schema=True # New parameter
)
# Option 2: Standalone utility
from langchain.output_parsers.openai_tools import sanitize_for_openai_schema
schema = sanitize_for_openai_schema(MyModel)
# Option 3: Auto-detect and sanitize when strict=True
llm.with_structured_output(MyModel, method="json_schema", strict=True)
# Automatically applies sanitization when needed
Benefits for the Community
-
Zero model changes required - Existing Pydantic models work immediately
-
Cross-provider compatibility - Same models work with OpenAI, Anthropic, local LLMs
-
Better developer experience - Clear errors for unsupported patterns
-
Production-tested - Includes 1000+ lines of tests covering all edge cases
-
MIT licensed - Free to use even outside LangChain
The Code
The gist includes:
-
json_schema.py- The complete sanitizer implementation with extensive documentation -
test_json_schema_edge_cases.py- Comprehensive test suite
The code is MIT licensed, so anyone can use it in their projects immediately, whether or not it gets integrated into LangChain. However, I believe this would be a valuable addition to LangChain core, saving countless developers from these frustrating compatibility issues.
Example Usage
from pydantic import BaseModel, Field
from typing import Optional, List
class User(BaseModel):
name: str
age: Optional[int] = Field(None, ge=0, le=120)
tags: List[str] = []
metadata: dict = {}
# Without sanitizer: OpenAI API errors
# With sanitizer: Works perfectly!
schema = sanitize_for_openai_schema(User)
I’m happy to if this makes it into LangChain code. The implementation is production-ready and has been battle-tested with 20+ different Pydantic models. Let me know your thoughts or if you need any clarification!