Hello everyone,
I have been working on creating a RAG agent using create_agent()and the gpt-oss-120bmodel hosted on Azure AI Foundry, and I am encountering a very weird issue which I am unable to solve after weeks of trying.
Basically, when you ask the agent questions which require just a couple of tool calls, everything works fine. However, if you ask the agent a question which would require around 5+ tool calls, at some point the agent outputs a raw tool call in the OpenAI Harmony Response Format, which is then treated as the final answer and ends the agent graph prematurely. I am really confused why this happens, because the previous tool calls in this same agent invocation were all parsed fine and the tools were called successfully; this issue just randomly happens after many tool calls within one agent invocation.
Here is an example snippet of the issue:
================================ Human Message =================================
What updates have there been since v8.0?
================================== Ai Message ==================================
Tool Calls:
retrieve_from_manual (call_5b4617e1c3c34e69a510dbe9)
Call ID: call_5b4617e1c3c34e69a510dbe9
Args:
query: release notes 8.5
================================= Tool Message =================================
Name: retrieve_from_manual
Here are the relevant passages from the user manual (only use them if they are relevant to the user's question):
... (output truncated for simplicity)
================================== Ai Message ==================================
Tool Calls:
retrieve_from_manual (call_5b4617e1c3c34e69a510dbe9)
Call ID: call_5b4617e1c3c34e69a510dbe9
Args:
query: release notes 9.0
...
================================== Ai Message ==================================
<|start|>assistant<|channel|>analysis to=functions.retrieve_from_manual <|constrain|>json<|message|>{"query": "release notes 10.0"}<|call|>
I have been looking online to try find anyone else encountering this same issue, but couldn’t find anything at all with this same behavior. I am assuming that this is a LangChain issue and not an issue with the model itself.
Here is my code, which was more complex as it contained memory management, more middleware, etc., but I simplified it to the basic elements of creating the agent to try find the issue, but it still occurs anyways:
Package versions:
langchain==1.1.3
langchain-azure-ai==1.0.4
langchain-classic==1.0.0
langchain-community==0.4.1
langchain-core==1.1.3
langchain-openai==1.1.1
langchain-text-splitters==1.0.0
langgraph==1.0.4
langgraph-checkpoint==2.1.1
langgraph-prebuilt==1.0.5
langgraph-sdk==0.2.9
qdrant-client==1.15.1
sentence-transformers==5.1.0
Code:
import os
import re
from pathlib import Path
from dotenv import load_dotenv
from dataclasses import dataclass
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.checkpoint.serde.jsonplus import JsonPlusSerializer
from langchain.tools import tool, ToolRuntime
from langchain_core.messages import HumanMessage
from langchain.agents import create_agent
from langchain.agents.middleware import dynamic_prompt, ModelRequest
from qdrant_client import QdrantClient
from fastembed import SparseTextEmbedding
from sentence_transformers import SentenceTransformer
from qdrant_client import models
from qdrant_client.hybrid.fusion import distribution_based_score_fusion
from langchain_azure_ai.chat_models import AzureAIChatCompletionsModel
# Loading environment variables
load_dotenv()
# Initialize LLM for generation
llm = AzureAIChatCompletionsModel(
endpoint=os.getenv("AZURE_LLM_ENDPOINT"),
credential=os.getenv("AZURE_LLM_KEY"),
model="gpt-oss-120b",
max_tokens=4096,
temperature=0.0,
verbose=False,
client_kwargs={"logging_enable": False},
model_kwargs={"reasoning_effort": "high"}, # Only keep this uncommented for gpt-oss-120b
)
# Initialize vector DB
qdrant_client = QdrantClient(path="data/qdrant_vectordb")
# Initialize dense and sparse embedding models
dense_embedding_model = SentenceTransformer("Alibaba-NLP/gte-modernbert-base")
sparse_embedding_model = SparseTextEmbedding("Qdrant/bm25")
@dataclass
class Context:
user_name: str
language: str
version: str
# --------------------------------------------------------------------------------------------------------------------------
# Tool definitions
# --------------------------------------------------------------------------------------------------------------------------
@tool
def retrieve_from_manual(
query: str,
runtime: ToolRuntime[Context],
) -> str:
"""
Use this tool for searching the user manual to retrieve the passages relevant to the given query.
Args:
query: a short query string describing what you are searching for
Returns:
str: the relevant user manual passages
"""
if query == "":
# No query -> no retrieval
return "No query given."
# Dense retrieval
dense_results = qdrant_client.query_points(
collection_name="user_manual_collection",
query=dense_embedding_model.encode_query(query, normalize_embeddings=True),
using="dense",
limit=3000,
with_payload=True,
)
dense_results = dense_results.points
# Copy of IDs and scores since `dense_results` get updated after dbsf + score conversion
dense_ids_and_scores = [(doc.id, 1 - (2 - (doc.score*2))) for doc in dense_results[:101]]
# Sparse retireval
sparse_results = qdrant_client.query_points(
collection_name="user_manual_collection",
query=models.SparseVector(**next(sparse_embedding_model.embed(query)).as_object()),
using="sparse",
limit=3000,
with_payload=True,
)
sparse_results = sparse_results.points
dense_and_sparse_results = [dense_results, sparse_results]
# Hybrid search fusion
hybrid_dbsf_results = distribution_based_score_fusion(dense_and_sparse_results, limit=3)
# Store retrieved docs as (doc, dense_score)
retrieved_docs = []
for doc in hybrid_dbsf_results:
for dense_id, dense_score in dense_ids_and_scores:
if doc.id == dense_id:
retrieved_docs.append((doc, dense_score))
context = "Here are the relevant passages from the user manual (only use them if they are relevant to the user's question): \n\n"
for i, (doc, score) in enumerate(retrieved_docs):
title = doc.payload["metadata"].get("title", "Title N/A") # Get title for logging
url = doc.payload["metadata"].get("url", "URL N/A")
page_id = doc.payload["metadata"].get("page_id", "Page ID N/A")
is_full_page = doc.payload["metadata"].get("is_full_page", "is_full_page N/A")
if isinstance(url, str) and runtime.context.language != "ar":
url = url.replace("/en/", f"/{runtime.context.language}/", 1)
page_content = doc.payload["page_content"]
context += "-"*100 + "\n"
context += f"Passage {i+1}:\nPage Title: {title}\nPage URL: {url}\nPage ID: {page_id}\nis_full_page: {is_full_page}\nRelevance score: {score}\n"
context += page_content + "\n\n\n"
return context
def _parse_frontmatter(text: str) -> tuple[dict[str, str], str]:
"""
Parse pseudo-YAML frontmatter delimited by lines that equal '-_-_-_-'.
Returns (metadata_dict, body_without_frontmatter).
"""
# Match:
# - start of file
# - delimiter line
# - any content (frontmatter)
# - delimiter line
# - rest is body
pat = re.compile(r"^\s*-_-_-_-\s*\n(.*?)\n\s*-_-_-_-\s*\n?", re.DOTALL)
m = pat.match(text)
if not m:
return {}, text # no frontmatter, return as-is
raw_meta = m.group(1)
body = text[m.end():]
meta: dict[str, str] = {}
for line in raw_meta.splitlines():
line = line.strip()
if not line or line.startswith("#"):
continue
# Basic "key: value" parse
if ":" not in line:
continue
k, v = line.split(":", 1)
k = k.strip()
v = v.strip()
# strip quotes if present
if len(v) >= 2 and ((v[0] == v[-1] == "'") or (v[0] == v[-1] == '"')):
v = v[1:-1].strip()
# normalize null-ish
if v.lower() in {"null", "none"}:
v = ""
meta[k] = v
return meta, body
@tool
def get_full_page(page_id: str, runtime: ToolRuntime[Context],):
"""
Use this tool to return the full content of a specifc page given the page ID. This can be used if the
retrieved passages from the `retrieve_from_manual` tool contain a passage from a relevant page but are
missing additional relevant content typically found in a different passage in the same page.
Args:
page_id (str): the page ID of the relevant page.
Returns:
str: the full content of the specified page, or a helpful message if not found.
"""
markdown_dir = Path("data/processed_data")
max_body_chars = 125000 # safety character limit for very large files (applied to content body only)
# Sanitize page_id to prevent path traversal / weird chars
safe_id = re.sub(r"[^0-9A-Za-z_-]", "", str(page_id).strip())
if not safe_id:
return "No page_id provided (or it contained only invalid characters)."
try:
# 1) Exact prefix match: {page_id}_*.md
prefix_matches = sorted(
markdown_dir.rglob(f"{safe_id}_*.md"),
key=lambda p: len(p.name)
)
candidates = prefix_matches
# 2) Fallback: fuzzy match anywhere in the name
if not candidates:
fuzzy_matches = sorted(
markdown_dir.rglob(f"*{safe_id}*.md"),
key=lambda p: len(p.name)
)
candidates = fuzzy_matches
if not candidates:
return f"No Markdown file found for page_id '{safe_id}'."
# Prefer shortest filename on the assumption it's the canonical page file
target = candidates[0]
# Async file read via thread (keeps event loop responsive)
raw_text = target.read_text
meta, body = _parse_frontmatter(raw_text)
# Truncate BODY only (keep header intact)
truncated = False
if len(body) > max_body_chars:
body = body[:max_body_chars] + "\n\n…(truncated)…"
truncated = True
# Assemble the final output format
header_lines = [
"Here is the full content of the page:",
"",
f"Page ID: {meta.get('page_id', safe_id)}",
f"Page URL: {meta.get('url', 'N/A')}",
f"Module: {meta.get('module', 'N/A')}",
"",
]
full_page = "\n".join(header_lines) + body
return full_page
except Exception as e:
return f"Failed to load page '{safe_id}': {e}"
# --------------------------------------------------------------------------------------------------------------------------
# Dynamic system prompt definition
# --------------------------------------------------------------------------------------------------------------------------
@dynamic_prompt
def dynamic_context_system_prompt(request: ModelRequest) -> str:
"""Create the Agent's system prompt using dynamic context via @dynamic_prompt wrapper"""
version = request.runtime.context.version
system_prompt = f"""
(*Big system prompt describing behavior and tool use*)
"""
return system_prompt
# --------------------------------------------------------------------------------------------------------------------------
# Compiling LangChain Agent
# --------------------------------------------------------------------------------------------------------------------------
def build_agent():
"""Compile the LangChain Agent and set the global `rag_agent`"""
rag_agent = create_agent(
model=llm,
tools=[
retrieve_from_manual,
get_full_page,
],
middleware=[
dynamic_context_system_prompt,
],
context_schema=Context,
checkpointer=InMemorySaver(serde=JsonPlusSerializer(pickle_fallback=True)),
)
return rag_agent
if __name__ == "__main__":
rag_agent = build_agent()
# Runtime context for the agent
context = {
"user_name": "Adam",
"language": "en",
"version": "11.0",
}
thread_id = 123
print("\nStarting interactive demo. Type 'quit' to exit.\n")
while True:
user_input = input("\nUser: ").strip()
if user_input.lower() in ("quit", "exit"):
print("Exiting demo.")
break
user_msg = HumanMessage(content=user_input)
state = {"messages": [user_msg]}
print("\n--- STREAM OUTPUT ---")
prev_len = 0
for step in rag_agent.stream(input=state, config={"configurable": {"thread_id": thread_id}, "recursion_limit": 75}, context=context, stream_mode="values"):
### Values stream_mode ###
messages = step["messages"]
# Only print when a new message was appended
if len(messages) <= prev_len:
continue
prev_len = len(messages)
last_message = messages[-1]
last_message.pretty_print()
I don’t think this is an issue with the printing in the terminal, because I also have a version which is integrated in a FastAPI backend and a TypeScript frontend, and the same issue happens there.
If anyone is able to help that would be much appreciated!