ChatHuggingFace + HuggingFacePipeline code never parses tool call code

Hello Forum!

When using ChatHuggingFace + HuggingFacePipeline, the returned AIMessage never populates the AIMessage.tool_calls (it is always empty). Even if the output of the model contains well formatted json for tool calling.

Expected outcome:
AIMessage.tool_calls containing a list of ToolCall objects (like when using openAI’s chatmodel) when the llm response contains json object for tool call.

Diving into the source code, I can see that this is expected since there is no code for parsing the results.

When the llm is instance of HuggingFacePipeline, the ChatHuggingface.\_generate(…) method calls:

1. self._to_chat_prompt(..)

2. llm._generate(…)

3. self._to_chat_result(…)

The llm._generate method returns an LLMResults (which does not contain any field for tool_calls), so my guess would be that the ChatHuggingaFace model should be parsing the tool_calls and creating the appropriate ToolCall objects.

I have some bandwidth to work on this. It would be great if we could align on possible solutions and how to implement them. If you think that this change can be interesting enough to be incorportated, can think on a possible ways to implement and post them here for a more focused/precise discussion.

Looking forward to hear your thoughts!!

I’m using the following versions:

  • langchain=1.0.3

  • langchain_core=1.0.3

  • langchain-huggingface=1.0.1

I’m posting an example code to reproduce the issue, just in case I am doing something wrong.

import torch

from langchain_core.messages import SystemMessage, HumanMessage
from langchain_huggingface.llms import HuggingFacePipeline
from langchain_huggingface import ChatHuggingFace
from langchain.tools import tool



@tool
def multiply(a: int, b: int) -> int:
    """Multiply a and b.

    Args:
        a: first int
        b: second int
    """
    return a * b

model_name='Qwen/Qwen3-4B-Thinking-2507'

llm = HuggingFacePipeline.from_model_id(
    model_id=model_name,
    task='text-generation',
    device=0,
    batch_size=1,
    model_kwargs={'temperature': 0.1,
                  'max_length': 8192,
                  'torch_dtype':torch.float16
                  },
)

sys_msg = SystemMessage(content="You are a helpful assistant tasked with performing arithmetic on a set of inputs.")

prompt='''
multiply 3 by 4. 
You must use a tool name multiply which receives as parameters the two numbers to be multiplied
Respond only using a JSON blob with the following format:
{
  "name": "multiply",
  "args": { "a": "3", "b": "4" },
  "id": "multiply_call",
  "type": "tool_call"
}
'''
human_msg = HumanMessage(content=prompt)


chat_model = ChatHuggingFace(llm=llm)
llm_with_tools = chat_model.bind_tools([multiply])

llm_output = llm_with_tools.invoke([sys_msg,human_msg])

print(llm_output.tool_calls)

Diego

Hi @diegomarron

it seems lke this is an expected limitation when ChatHuggingFace wraps HuggingFacePipeline. It’s not an issue in your code.
With the pipeline backend, ChatHuggingFace converts the raw generated text into an AIMessage without attempting to parse tool calls, so AIMessage.tool_calls stays empty.
Tool-call parsing is only implemented for the structured chat APIs (e.g., Hugging Face TGI/Inference Endpoints), not for raw pipelines - e.g. HuggingFaceEndpoint.

If you must stay on pipelines, post-process the model’s JSON output yourself and map it to ToolCall objects (e.g., parse the JSON and set AIMessage.tool_calls), or use a structured-output parser like JsonOutputParser/JsonOutputKeyToolsParser to extract the call and then invoke your tool.

Some examples:

from langchain_huggingface.llms import HuggingFaceEndpoint
from langchain_huggingface import ChatHuggingFace
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_core.tools import tool

@tool
def multiply(a: int, b: int) -> int:
    return a * b

# Requires HUGGINGFACEHUB_API_TOKEN env var or pass huggingfacehub_api_token=...
llm = HuggingFaceEndpoint(
    repo_id="microsoft/Phi-3-mini-4k-instruct",  # or your endpoint/model
    max_new_tokens=64,
    do_sample=False,
)

chat = ChatHuggingFace(llm=llm)
chat_with_tools = chat.bind_tools([multiply])

msgs = [
    SystemMessage(content="You can call tools when needed."),
    HumanMessage(content="Multiply 3 by 4 using the tool."),
]
ai = chat_with_tools.invoke(msgs)
print(ai.tool_calls)
from langchain_community.llms.huggingface_text_gen_inference import (
    HuggingFaceTextGenInference,
)
from langchain_huggingface import ChatHuggingFace
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_core.tools import tool

@tool
def multiply(a: int, b: int) -> int:
    return a * b

# TGI server must be running (e.g., http://localhost:8080)
llm = HuggingFaceTextGenInference(
    inference_server_url="http://localhost:8080",
    max_new_tokens=64,
    temperature=0.0,
)

chat = ChatHuggingFace(llm=llm)
chat_with_tools = chat.bind_tools([multiply])

msgs = [
    SystemMessage(content="You can call tools when needed."),
    HumanMessage(content="Multiply 3 by 4 using the tool."),
]
ai = chat_with_tools.invoke(msgs)
print(ai.tool_calls)

Hi @pawel-twardziak
Thank you so much for your pointer. Now I have a better picture :slight_smile:
Sadly and for legal constrains, I must stick to offline models that run locally.

I would like to politely raise an observation regarding the current implementation of ChatHuggingface + HuggingfacePipeline.

While I can certainly work around the current state, including parsing the model’s output and manually injecting the necessary tools directly into the prompt, this process presents a significant inconvenience. This necessity essentially forces me to re-implement fundamental logic that LangChain is supposed to simplify.

I would expect the ChatHuggingface and HuggingfacePipeline combination to work similarly to other chat models:

  • .bind_tools(…)

  • automatically pass the tools schema and output generation constraints to the prompt

  • fill the tool_calls for me

In addition, the current implementation of ChatHuggingface also requires message preprocessing before 'llm.invoke(…) since:

  • The ‘_to_chat_prompt(..)’ method does not support ToolCall nor ToolMessage objects

  • The ‘_to_chatml_prompt(..)’ requires the last message to be type HumanMessage, while for a ReAct agent, this is not always the case (it may come back from a ToolNode).

Streamlining this integration would dramatically enhance developer efficiency, uphold the core promise of the LangChain framework, and unlock more seamless integration of all supported model providers. I want to emphasize again that I have the bandwidth to work on this with you.

Thank you very much for your time :slight_smile:

Diego