Model agnostic multimodal LLM call

Hi all,

I’m experimenting with LangGraph for a simple prototype where the graph has just one node that calls an LLM. The wrinkle is that the LLM call is not a trivial text prompt:

  • The input includes a PDF file that must be uploaded (and referenced) in the model call.

Right now I’m handling this by calling the OpenAI Responses API directly inside the node. This works fine, but it makes my graph tightly coupled to OpenAI. I would like to make the workflow model-agnostic so that later I can swap in Anthropic, Gemini, or other providers without rewriting the graph logic.

My questions:

  1. Does LangGraph provide any built-in functions or abstractions to handle multimodal message objects (text + image/file), or is the expectation that developers wrap each provider’s SDK/API in their own adapter nodes?

  2. If the latter, is the recommended pattern to:

    • Define a neutral internal schema for messages (e.g., {type: "text" | "image" | "file", content: ...})

    • Then write per-provider adapters that map this neutral schema to the provider’s required structure?

  3. Are there any examples of LangGraph projects where multimodal input is handled in a provider-agnostic way, so that graph nodes remain portable across OpenAI, Anthropic, Gemini, etc.

Thanks a lot for any guidance!

Hello, thanks for this question.

If you’re using LangChain chat models with LangGraph, it will take care of this for you. See this guide.

Note that there is a minor change planned for this format in the upcoming v1.0 release (although the v0.3* format will continue to work in 1.0). You can reference the associated v1.0 docs here. 1.0 will be released this month and alpha releases are available now.

Thanks!

I also realized that when I upload the PDF, only the text layer of it appears to be considered, whereas the image itself is not. Do you have more info about this? Do you consider it necessary to render my pdf into an image and provide both PDF and image?