Truncated JSON & Hidden Generations - GPT-4.1 + PydanticOutputParser

Using PydanticOutputParser with GPT-4.1, I keep hitting truncated JSON issues—turns out, the model sometimes generates extra messages (generations) after a failure, but the ainvoke API only returns one. Switching to the generations API just for this feels clunky.

Anyone know how to deal with this?

(langchain (1.2.10), langchain-core (1.2.7), AzureChatOpenAI from langchain-openai (1.1.7))

The ainvoke of the BaseChatModel literally return the first message of the first generation. Maybe it makes sense have an option to return the last message?

class BaseChatModel(BaseLanguageModel[AIMessage], ABC):
    ...
    async def ainvoke(
        ...
        return cast(

            "AIMessage", cast("ChatGeneration", llm_result.generations[0][0]).message

        )

Are you using `with_structured_output? If yes then use strict=true.
Doc Reference: Models - Docs by LangChain

I’m aware of this option, but one would except more consistency with the current models and any configuration by now. Nevertheless, multiple messages in a single generation(-option) is really a weird one. Must be an api thing.

I’m intentionally not using the with_structured_output for the reason that i want reasoning before the structure data to increase performance/quality. gpt-4.1 is the most intelligent model, but not a reasoning model. Furthermore, the reasoning models only provide reasoning summaries. That’s no good in my use case.

You can include a `reasoning` parameter in the Pydantic class that you are using. That always works well for me.

I use to do that too, but the model has no chance to reflect anymore since that’s the final “answer”. What really improves quality for complex tasks is to let it reason on a “scratch pad” before writing the final answer. One could also prompt chain it, but this way you can do it in 1 invocation.

Any any case, thank for sparring @keenborder786

1 Like

Hi @weipienlee

Yeah, I totally get the frustration here — this one feels more like an abstraction leak than something you’re doing wrong.

What’s really happening is:

GPT-4.1 sometimes “tries again” internally if it messes up the JSON (like truncating it). The generations API actually contains those retries. But ainvoke() only returns the first message of the first generation, so if that first one is broken, you never see the corrected version.

So it’s not really a Pydantic problem — it’s just how ainvoke() is currently implemented.

Since you’re intentionally not using with_structured_output(strict=True) (which makes sense if you want scratchpad-style reasoning before the final structured output), here are a couple of practical ways people handle this:

1. Add a small retry on parse failure
If JSON parsing fails, just re-prompt with something like:
“Return only valid JSON matching the schema. No extra text.”
It’s simple, but surprisingly reliable in production — and you still keep your reasoning-first approach.

2. Return the last generation instead of the first
If you’re comfortable subclassing, you can override ainvoke() to return:

llm_result.generations[0][-1].message

Instead of [0][0].
In cases where the model internally corrected itself, that usually gives you the fixed version without switching everything to agenerate().

3. Structured reasoning field (if you ever reconsider)
You can keep reasoning by explicitly including it in your Pydantic model:

class Output(BaseModel):
    reasoning: str
    answer: FinalSchema

Then instruct the model to think in reasoning and put the final structured output in answer.
That preserves reflection but still constrains the final output shape.


If I were optimizing for stability while keeping scratchpad reasoning, I’d probably combine:

  • returning the last generation

  • plus a lightweight retry on JSON parse failure

That tends to eliminate most truncation issues without making the architecture feel clunky.

@Bitcot_Kaushal, my experience is that you need much effort to engineer the reasoning field description to get a comparable quality of a “scratch pad” reasoning. It really makes difference, with the current OpenAI models at least.

I ended up overriding the chatmodel. you have to peel it from the content and create an new message instance , something like:

org_msg = llm_result.generations[0][0].message # type: ignore
if (org_msg.content and isinstance(org_msg.content, list)):
    return AIMessage(
        content=[org_msg.content[-1]],
        additional_kwargs=org_msg.additional_kwargs,
    )

# fallback to base behavior
return await super().ainvoke(input, config=config, stop=stop, **kwargs)

tnx, for sparring, though.

Hi @weipienlee

That makes a lot of sense actually — and you’re absolutely right.

In practice, getting a “reasoning” field to behave like a true scratchpad does take real prompt engineering effort. With current OpenAI models especially, the quality difference between free-form internal reasoning vs. schema-constrained reasoning can be noticeable.

I like your approach of peeling the last content item and reconstructing a new AIMessage. It’s a pretty clean way to:

  • Preserve the internal retry behavior

  • Avoid switching to agenerate() everywhere

  • Keep scratchpad-style reasoning intact

  • Not force strict structured output

Your override pattern is pragmatic and minimal — which I honestly prefer over over-engineering the schema just to simulate reasoning.