How can i create a tool, which will allow agent to read and understand the image, What should the tool return so that agent can read the image or any other binary formats. Note that i don’t want to make the tool which describes the image, instead i want to create the tool which will allow agents to read the image/images?
Hi @hottered10 ! Correct me if I’m wrong but you are trying to create a tool that takes as input an image (by url or file path) and returns some data for the agent to comprehend.
Usually, you can call a multimodal model by passing a message that specifies the file format and the source type (check the Multimodality guide). By appending multiple messages you can not only textually describe the image but also providing a more structured output for the main agent to digest.
Inside your tool you’ll still need a way to process that image (by using an llm or a more traditional method).
Let me know if this helped!
I came across a similar problem, and I only found tools that describe images by returning a string. I want my agent to be able to recognize when an image contains a table and understand its contents, or for an example it’s an image of nature, to recognize the objects in it, not just receive a descriptive string from a tool.
Hey @rako and @hottered10 you could do something similar to the following:
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_core.tools import tool
class ImageInfo(BaseModel):
description: str = Field(description="A description of the image")
table: bool = Field(description="Whether there is a table in the image")
table_content: str = Field(description="The content of the table if there is one")
llm = ChatOpenAI(model="gpt-4o-mini")
structured_llm = llm.with_structured_output(ImageInfo)
@tool
def extract_image_info(image_url: str) -> dict:
"""
Extracts structured information from an image given the following output schema:
- description: A description of the image
- table: Whether there is a table in the image
- table_content: The content of the table if there is one
Args:
image_url: The URL of the image to analyze
Returns:
A dictionary containing the description, table, and table content
"""
response = structured_llm.invoke([
SystemMessage(content="You are a helpful assistant that extracts structured information from an image."),
HumanMessage(
content=[
{"type": "text", "text": "Analyze the image"},
{
"type": "image",
"source_type": "url",
"url": image_url,
},
],
)
])
return response.model_dump()
This is a possible solution to your problem, then it’s up to the ability of the model to comprehend tables in images. Inside the tool you could make more complex operations like splitting the image in multiple chunks, analyze each one of them and merge the results together. You could use a computer vision model to detect tables/object in images etc
You can generally create ToolMessages containing an image content block, e.g.,
ToolMessage(
content=[
{
"type": "image",
"source_type": "base64",
"data": image_data,
"mime_type": "image/jpeg",
},
],
tool_call_id="...",
)
The above format is LangChain’s cross-provider standard. Providers also support OpenAI format:
ToolMessage(
content=[
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
},
],
tool_call_id="...",
)
Note that not all LLM providers support this functionality. I believe Anthropic and Gemini both do.
This is admittedly not documented well, but is described in the docs for a standard test we run on select models supporting this feature.
Hello @marco and @chester-lc! Thank you for the help, I’ll try the suggestions you sent me, and if you think of anything else, please send it my way
@chester-lc Thanks for the help. I tried test exaple from LangChain you referenced. Here is the code:
import base64
from langchain_core.messages import HumanMessage, ToolMessage, AIMessage
from langchain_openai import ChatOpenAI
def random_image() -> dict:
"Return a image"
return ""
llm = ChatOpenAI(
model="gpt-4.1-2025-04-14", # make sure your API key supports this
temperature=0,
)
image_path = "/home/bogdan/Desktop/banana.jpeg"
with open(image_path, "rb") as image_file:
image_data = base64.b64encode(image_file.read()).decode("utf-8")
tool_message = ToolMessage(
content=[
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
},
],
tool_call_id="1",
name="random_image",
)
messages = [
HumanMessage(
content=[{"type": "text", "text": "Can you load the image and describe it?"}]
),
AIMessage(
[],
tool_calls=[
{
"type": "tool_call",
"id": "1",
"name": "random_image",
"args": {},
}
],
),
tool_message,
]
response = llm.bind_tools([random_image]).invoke(messages)
print(response)
But i got this AIMessage as response:
response = AIMessage(content='It appears that no image has been uploaded or provided yet. Please upload an image or provide a link, and I’ll be happy to describe it for you!',...)
I also tried to call tool which returns this type of dictionary:
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{encoded_image}"},
}
@tool
def image_tool(input: str) -> dict:
"""This is tool for loading image"""
file_path = "/home/bogdan/Desktop/banana.jpeg"
with open(file_path, "rb") as f:
image_data = f.read()
encoded_image = base64.b64encode(image_data).decode("utf-8")
return {
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{encoded_image}"},
}
llm = ChatOpenAI(model="gpt-4o", temperature=0)
agent = initialize_agent(
tools=[image_tool],
llm=llm,
agent="zero-shot-react-description",
verbose=True,
handle_parsing_errors=True,
)
response = agent.invoke(
{"input": "Can you load the image and tell me what is on the picture?"}
)
print(response)
But i got this response:
Thought:The image has been successfully loaded, and it appears to be a complex scene with various elements. However, I cannot directly interpret or describe the contents of the image from the base64 data provided. You may need to use an image viewer to see the image or provide a description of what you see.
Observation: Invalid Format: Missing 'Action:' after 'Thought:
Thought:I cannot directly interpret or describe the contents of the image from the base64 data provided. You may need to use an image viewer to see the image or provide a description of what you see.
Do you know any other solution where agent will know that return type contains image and know how to process it? Thanks again.
OpenAI does not support images in tool responses. Refer to their API reference:
The same appears true for their Responses API. One option is to instead use structured outputs to generate your tool payload, and then return the response as a user message instead of a tool message:
import base64
from langchain.chat_models import init_chat_model
import httpx
from pydantic import BaseModel
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")
class GetImage(BaseModel):
"""Generate a query to fetch an image."""
query: str
llm = init_chat_model("openai:gpt-4.1")
input_message = {"role": "user", "content": "Retrieve an image of a boardwalk."}
response = llm.invoke([input_message], response_format=GetImage)
parsed = response.additional_kwargs["parsed"]
assert isinstance(parsed, GetImage)
print(f"Fetching image: {parsed.query}")
image_message = {
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
}
]
}
llm.invoke(
[input_message, response, image_message]
)