Feature (Code attached): Add JSON Schema Sanitizer for OpenAI Structured Outputs Compatibility

Pydantic to OpenAI Structured Outputs Compatibility Solution (with implementation)

I’ve developed a comprehensive solution (available as a MIT-licensed gist) for the widespread compatibility issues between Pydantic models and OpenAI’s Structured Outputs that I believe should be integrated into LangChain to benefit the entire community.

The Problem

Many LangChain users encounter frustrating errors when using .with_structured_output() with OpenAI’s strict JSON schema mode. The root cause is that OpenAI’s Structured Outputs only support a very limited subset of JSON Schema, while Pydantic generates rich, full-featured schemas. This mismatch causes failures for common patterns:

  • Optional fields (Optional[str] = None) generate {"type": ["string", "null"]} which OpenAI rejects

  • Numeric constraints (Field(ge=0, le=100)) use minimum/maximum keywords that OpenAI doesn’t support

  • Recursive models (tree structures, linked lists) use $ref which OpenAI explicitly forbids

  • Union types generate anyOf/oneOf which aren’t supported in strict mode

  • Missing or empty additionalProperties cause validation errors

Currently, developers must either rewrite their Pydantic models (breaking compatibility with other providers) or manually craft OpenAI-specific schemas (error-prone and tedious).

The Solution

I’ve created a sanitize_for_openai_schema() function that automatically transforms any Pydantic model into an OpenAI-compatible schema. The implementation:

  1. Converts optionals to OpenAI’s preferred required+nullable pattern

  2. Detects recursive models early and fails with helpful error messages

  3. Migrates constraints to descriptions for app-side validation

  4. Collapses unions intelligently while preserving nullability

  5. Fixes additionalProperties to always have proper typing

  6. Preserves field order from the original Pydantic model

Integration with LangChain

This could be integrated into LangChain in several ways:

# Option 1: Automatic in with_structured_output
llm.with_structured_output(
    MyModel, 
    method="json_schema",
    strict=True,
    sanitize_schema=True  # New parameter
)

# Option 2: Standalone utility
from langchain.output_parsers.openai_tools import sanitize_for_openai_schema
schema = sanitize_for_openai_schema(MyModel)

# Option 3: Auto-detect and sanitize when strict=True
llm.with_structured_output(MyModel, method="json_schema", strict=True)
# Automatically applies sanitization when needed

Benefits for the Community

  • Zero model changes required - Existing Pydantic models work immediately

  • Cross-provider compatibility - Same models work with OpenAI, Anthropic, local LLMs

  • Better developer experience - Clear errors for unsupported patterns

  • Production-tested - Includes 1000+ lines of tests covering all edge cases

  • MIT licensed - Free to use even outside LangChain

The Code

The gist includes:

  • json_schema.py - The complete sanitizer implementation with extensive documentation

  • test_json_schema_edge_cases.py - Comprehensive test suite

The code is MIT licensed, so anyone can use it in their projects immediately, whether or not it gets integrated into LangChain. However, I believe this would be a valuable addition to LangChain core, saving countless developers from these frustrating compatibility issues.

Example Usage

from pydantic import BaseModel, Field
from typing import Optional, List

class User(BaseModel):
    name: str
    age: Optional[int] = Field(None, ge=0, le=120)
    tags: List[str] = []
    metadata: dict = {}

# Without sanitizer: OpenAI API errors
# With sanitizer: Works perfectly!
schema = sanitize_for_openai_schema(User)

I’m happy to if this makes it into LangChain code. The implementation is production-ready and has been battle-tested with 20+ different Pydantic models. Let me know your thoughts or if you need any clarification!