With too many fields, how should deepagents handle this properly?

I have a question,
I have a data requirement for an interface.
Users may inquire about information related to Huawei Technologies Co., Ltd., which is an imprecise field.
But maybe later they will ask for the website of Huawei Technologies Co., Ltd. This is a precise field, and it could be a list, meaning the user may ask for multiple precise fields.
There is another possibility: asking for both basic information about Huawei technology and specific fields. We need to adjust two interfaces.
My current thinking is that the basic information interface is fine as it is. However, for the interface of precise fields, the actual parameter should be a list. Then, iterate through the list to obtain the field results.
However, there is a problem. If the user asks for too many fields, I can write another function or method to first compare the user’s required fields with the built-in fields (this requires semantic comparison), and then query and return the results for the correctly existing fields.
This section aims to prevent errors caused by directly passing user-provided parameters to the model.
However, in the past, I used the OpenAI SDK to assemble the context myself and could arrange any results as I wished. But now that I’m using DeepAgents, I don’t know how to achieve this effect.
I hope someone can help me. Perhaps there is a deviation in my thinking, or maybe deepagents actually have a good approach. I am still a novice in deepagents.
:confounded_face:

I didn’t include field information or related descriptions in the tool metadata. I rely on keyword matching for execution, and the tool metadata merely serves as a guide for users to search for relevant terms. As a result, my token count has decreased, and the accuracy is decent. However, I still think there might be a better approach?

Hello @yech
Great question, and your instincts are mostly right. Let me break this down clearly for you.


The Core Mental Shift: Tools Are Your Interface

When you were using the raw OpenAI SDK, you were assembling context manually, you controlled what went into the message list. With DeepAgents (which sits on top of LangGraph/LangChain), that control moves into tools. The agent calls your tools, and whatever you return from a tool becomes part of the context automatically. So instead of thinking “how do I arrange the context?”, think “how do I design my tools to return exactly the right data?”.


Designing Your Two Tools

You’ve correctly identified that you need two separate tools. Here’s how to think about them:

Tool 1 — General/Fuzzy Company Info

This takes a free-form query and returns narrative information. No strict field validation needed.

from langchain_core.tools import tool

@tool
def get_company_overview(query: str) -> str:
    """Return general background information about Huawei Technologies Co., Ltd.

    Use this when the user asks broad, open-ended questions about the company
    (history, products, size, culture, etc.).

    Args:
        query: The user's free-form question about the company.
    """
    # Call your data source / RAG pipeline / API here
    return fetch_narrative_info(query)

Tool 2 — Precise Structured Fields (with validation)

This is where your design question lives. The parameter is a list[str] and the tool itself handles validation before doing anything expensive.

from langchain_core.tools import tool
from pydantic import BaseModel, field_validator

# These are your canonical field names — the source of truth
KNOWN_FIELDS: dict[str, str] = {
    "website":        "Official website URL",
    "headquarters":   "Headquarters location",
    "founded":        "Year founded",
    "ceo":            "Current CEO name",
    "employees":      "Approximate number of employees",
    "revenue":        "Latest annual revenue",
    "stock_ticker":   "Stock exchange and ticker symbol",
    "phone":          "Main contact phone number",
}

class CompanyFieldsInput(BaseModel):
    fields: list[str]

    @field_validator("fields")
    @classmethod
    def normalize_fields(cls, raw_fields: list[str]) -> list[str]:
        resolved = []
        for f in raw_fields:
            match = resolve_field(f)   # your semantic matching function
            if match:
                resolved.append(match)
        return resolved

@tool(args_schema=CompanyFieldsInput)
def get_company_fields(fields: list[str]) -> dict[str, str]:
    """Fetch specific, structured data fields for Huawei Technologies Co., Ltd.

    Use this when the user asks for one or more specific pieces of data
    (e.g., website, CEO, revenue). Returns only the fields that exist.

    Args:
        fields: List of field names the user wants (e.g., ["website", "ceo"]).
    """
    results = {}
    for field in fields:
        results[field] = fetch_field_from_source(field)
    return results

The Semantic Field Validation — Your Key Concern

This is the part where you prevent the model from hallucinating field names and passing garbage into your data layer. Here’s a clean pattern using LangChain embeddings:

from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_core.documents import Document

# Build a small in-memory vector store of your known fields at startup
_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
_field_store = InMemoryVectorStore.from_documents(
    documents=[
        Document(page_content=f"{name}: {desc}", metadata={"field": name})
        for name, desc in KNOWN_FIELDS.items()
    ],
    embedding=_embeddings,
)

def resolve_field(user_field: str, threshold: float = 0.75) -> str | None:
    """Map a user-supplied field name to a canonical field name.

    Returns the canonical name if a confident match is found, else None.

    Args:
        user_field: Raw field name as provided by the user or LLM.
        threshold: Minimum cosine similarity score to accept a match.
    """
    results = _field_store.similarity_search_with_score(user_field, k=1)
    if not results:
        return None
    doc, score = results[0]
    if score >= threshold:
        return doc.metadata["field"]
    return None

If you don’t want to use embeddings (e.g., for lower latency), a lightweight alternative is rapidfuzz:

from rapidfuzz import process, fuzz

def resolve_field(user_field: str, threshold: int = 70) -> str | None:
    match, score, _ = process.extractOne(
        user_field,
        KNOWN_FIELDS.keys(),
        scorer=fuzz.WRatio
    )
    return match if score >= threshold else None

Putting It All Together With DeepAgents

from deepagents import create_deep_agent

agent = create_deep_agent(
    model="openai:gpt-4o",
    tools=[get_company_overview, get_company_fields],
    system_prompt=(
        "You are a Huawei Technologies research assistant. "
        "For broad questions use get_company_overview. "
        "For specific data points (website, CEO, revenue, etc.) use get_company_fields. "
        "You may call both in the same response if the user wants both."
    ),
)

result = agent.invoke({
    "messages": [{
        "role": "user",
        "content": "Tell me about Huawei, and also give me their website and CEO."
    }]
})

The agent will naturally decide to call both tools when the user asks for both general info and specific fields, because that’s exactly what the tool descriptions guide it to do.


Addressing Your “Context Assembly” Concern Directly

You mentioned you used to arrange results yourself with the raw OpenAI SDK. In DeepAgents, you get equivalent control through two mechanisms:

  1. Tool return values — Whatever your tool returns (string, dict, list) is serialized into a ToolMessage and injected into the conversation context. You control the shape of that data entirely inside the tool function.

  2. response_format for structured final output — If you need the agent’s final answer to follow a strict schema (not just intermediate tool results), use the response_format parameter:

from pydantic import BaseModel
from langchain.agents.structured_output import ResponseFormat

class HuaweiReport(BaseModel):
    summary: str
    fields: dict[str, str]
    missing_fields: list[str]

agent = create_deep_agent(
    model="openai:gpt-4o",
    tools=[get_company_overview, get_company_fields],
    response_format=ResponseFormat(schema=HuaweiReport),
)

Summary

Concern Solution in DeepAgents
Two separate interfaces Two @tool functions with clear descriptions
List of fields as input args_schema=CompanyFieldsInput with Pydantic
Validate fields before querying resolve_field() inside the tool (embeddings or fuzzy match)
Control what goes into context Control what the tool returns
Structured final output response_format=ResponseFormat(schema=YourModel)
User asks for both Agent calls both tools naturally based on tool descriptions

Your original thinking was sound, the key realization is that the tool function is the right place to put your validation and normalization logic, not in a separate pre-processing step outside the agent. This keeps the agent’s interface clean while protecting your data layer from bad inputs.

Thank you for your reply,
I am currently using keyword matching.
I have a filter dictionary used to match user input with keywords:

FIELD_INDEX: dict[str, list[str]] = {
    'user_input1': ['field1'],
    # ... and so on
    'user_input255': ['field255'],
}

def match_fields(query: str) -> list[str]:
    """Keyword matching: extract corresponding API field names from user queries."""
    matched = []
    seen = set()
    for keyword, fields in FIELD_INDEX.items():
        if keyword in query:
            for f in fields:
                if f not in seen:
                    seen.add(f)
                    matched.append(f)
    return matched

async def query_func1(company_name: str) -> str:
    """This is the first interface, which is insignificant"""
    pass

async def query_field(company_name: str, query: str) -> str:
    """Second interface — keyword-based field matching."""
    fields = match_fields(query)
    if not fields:
        return f"No fields related to '{query}' were matched. Please try a more specific description."
    
    result = await api.kyc(company_name)
    data = result["result"]["Data"]
    lines = []
    for field in fields:
        value = deep_get(data, field)
        if value == "unknown":
            lines.append(f"\n【{field}】No relevant data found")
        else:
            lines.append(f"\n【{field}】")
            lines.append(format_value(value))
    return "\n".join(lines)

def deep_get(data: dict, key: str, default="unknown"):
    """Recursively search for a value in a nested dictionary by field name, regardless of how many levels deep it is."""
    if key in data:
        return data[key]

    for v in data.values():
        if isinstance(v, dict):
            result = deep_get(v, key, default)
            if result != default:
                return result
        elif isinstance(v, list):
            for item in v:
                if isinstance(item, dict):
                    result = deep_get(item, key, default)
                    if result != default:
                        return result
    return default

Since there is a requirement for precise querying, vector matching approximation cannot be used.
It can be considered as LLM applied to SQL.
Could you please tell me if my approach is acceptable? I have tried this method, but the actual results have not met my expectations.
I would like to ask if you have a better approach? Using LLM for SQL, as a tool for agent.
I am looking forward to your reply very much~ :partying_face:

If there are too many fields or fields connected to other tables, the description or schema of the tool may lead to an explosion of input tokens, right? :confounded_face:

Perhaps I didn’t express myself accurately. What I meant to say is: the key is that users should be able to access all 255 fields at any time. How can we consider this situation? Or, does LLM have a better way to select fields based on user questions? (Here, we consider that there are too many fields in a single category)

For example, select (llm options as field) from tables; then the entire result can be understood and output as needed.

It’s structured data, from top to bottom.

This approach of using query as a parameter, which can consist of multiple different fields, much like SQL seems the most viable, rather than hardcoding all fields individually:

async def query_field(company_name: str, query: str) -> str:
    """
    Second interface — keyword-based field matching.
    """
    fields = match_fields(query)

    if not fields:
        return (
            f"No fields related to '{query}' were matched. "
            "Please try a more specific description."
        )

    result = await api.kyc(company_name)
    data = result["result"]["Data"]

    lines = []

    for field in fields:
        value = deep_get(data, field)

        if value == "unknown":
            lines.append(f"\n【{field}】No relevant data found")
        else:
            lines.append(f"\n【{field}】")
            lines.append(format_value(value))

    return "\n".join(lines)

That said, I would recommend improving the Tool Description by encoding within it the fields that can be part of the query. Including a few examples of valid queries in the Tool Description could also help improve performance (just an example).

@tool
async def query_company_fields(company_name: str, query: str) → str:
“”"
Query specific fields from a company’s KYC profile.

Available fields:
    Registration:
        registered_name, incorporation_date, registered_address,
        operating_status, business_scope, company_type

    Financials:
        registered_capital, paid_in_capital, annual_revenue,
        credit_rating, tax_id

    Legal:
        licenses, litigation_records, administrative_penalties,
        bankruptcy_status

    Personnel:
        legal_representative, shareholders, beneficial_owners,
        board_members

Example queries and the fields they map to:
    "Is the company still active?"
        → operating_status

    "Who founded the company?"
        → legal_representative, shareholders

    "Any lawsuits or penalties?"
        → litigation_records, administrative_penalties

    "What is the registered capital?"
        → registered_capital, paid_in_capital

    "Tell me about the beneficial owners"
        → beneficial_owners, shareholders

Args:
    company_name: The company to look up.
    query: Natural language description of what information is needed.

Returns:
    Formatted field values from the company's KYC record.
"""
return await query_field(company_name, query)


Additional Suggestion for Improving query_field

I have another suggestion to improve your query_field. Rather than doing deterministic matching as you are doing now, you can make use of another LLM call for matching (below is just an example and given this will be only one LLM call with no past Message history the LLM context window can handle large schema token size):

FIELD_SCHEMA = """

field1: Legal registered name of the company

field2: Date of incorporation

field3: Registered address (full)

field4: Operating status (active/inactive/dissolved)

...

field255: ...

"""

FIELD_SELECTOR_PROMPT = """

You are a data field selector. Given a user question, return ONLY the relevant field names from the schema below as a JSON list.

Schema:

{schema}

User question: {query}

Return only a JSON array of field names, e.g. ["field1", "field42"]. No explanation.

"""

from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import ChatPromptTemplate

async def select_fields_with_llm(query: str) -> list[str]:
    """Use LLM to select relevant fields from schema based on user query."""
    prompt = ChatPromptTemplate.from_template(FIELD_SELECTOR_PROMPT)
    chain = prompt | llm | JsonOutputParser()

    fields = await chain.ainvoke({
        "schema": FIELD_SCHEMA,
        "query": query
    })

    # Validate: only return fields that actually exist
    valid = set(ALL_FIELD_NAMES)
    return [f for f in fields if f in valid]

async def query_field(company_name: str, query: str) -> str:
    fields = await select_fields_with_llm(query)

    if not fields:
        return f"No relevant fields found for: '{query}'"

    result = await api.kyc(company_name)
    data = result["result"]["Data"]

    lines = []

    for field in fields:
        value = deep_get(data, field)

        if value == "unknown":
            lines.append(f"【{field}】No data available")
        else:
            lines.append(f"【{field}】")
            lines.append(format_value(value))

    return "\n".join(lines)

I hope this helps

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.