Hi @weipienlee
Yeah, I totally get the frustration here — this one feels more like an abstraction leak than something you’re doing wrong.
What’s really happening is:
GPT-4.1 sometimes “tries again” internally if it messes up the JSON (like truncating it). The generations API actually contains those retries. But ainvoke() only returns the first message of the first generation, so if that first one is broken, you never see the corrected version.
So it’s not really a Pydantic problem — it’s just how ainvoke() is currently implemented.
Since you’re intentionally not using with_structured_output(strict=True) (which makes sense if you want scratchpad-style reasoning before the final structured output), here are a couple of practical ways people handle this:
1. Add a small retry on parse failure
If JSON parsing fails, just re-prompt with something like:
“Return only valid JSON matching the schema. No extra text.”
It’s simple, but surprisingly reliable in production — and you still keep your reasoning-first approach.
2. Return the last generation instead of the first
If you’re comfortable subclassing, you can override ainvoke() to return:
llm_result.generations[0][-1].message
Instead of [0][0].
In cases where the model internally corrected itself, that usually gives you the fixed version without switching everything to agenerate().
3. Structured reasoning field (if you ever reconsider)
You can keep reasoning by explicitly including it in your Pydantic model:
class Output(BaseModel):
reasoning: str
answer: FinalSchema
Then instruct the model to think in reasoning and put the final structured output in answer.
That preserves reflection but still constrains the final output shape.
If I were optimizing for stability while keeping scratchpad reasoning, I’d probably combine:
That tends to eliminate most truncation issues without making the architecture feel clunky.