Structured data fields (1000+): Dedicated LLM channel vs vectorized field names?

Related to this previous topic (now closed):

Following up with a more specific question…
Thank you very much for your idea. I think using another LLM to process field information is very effective. It does not participate in the user’s contextual dialogue, but only matches and confirms the user’s input parameters once. The field representation of this user input has been finalized through LLM interaction with the user.

I have an idea now and I would like to seek your or everyone’s help.

Current Usage

I use vectorization to store and retrieve unstructured data, such as:

  • Question and answer pairs
  • Images and videos
  • Some URLs
  • Abstracts
  • Other unstructured data

Core Question

But does structured data require vectorized retrieval? I don’t think it should be?

In my situation:

  • There would be too many fields in a table (all fields in this table belong to the same category, about customers)
  • Even fields from linked tables (data from different companies, but with the same category) would be aggregated together

Solution Options

So, in this case, should we:

  1. Use multiple LLM channels combined with type prompts from tool use for processing?
  2. Or use vectorized retrieval? (The vectorization here should be the type/pattern of the field)

My Preference and Concerns

I still prefer to use LLM to match user input parameters for structured data/fields. However, if there are too many fields, there are some issues:

  • Multiple identical categories exist (company information, company situation, company impact, etc.)
  • These categories are not specified by me, but returned by other interfaces
  • If I have to divide them, although it can avoid token consumption caused by too many fields, it will greatly increase my workload and be thankless

For example:

If you need to find the developer of the corresponding interface or redefine different names for the same fields in the returned interface. For example, if the interface f1 of provider 1 and the interface f1 of provider 2 are also counted as f1, I need to redefine: provider1_f1, provider2_f1

Extreme Scenario

It is generally impossible for the number of fields to reach 10000 (unless all of this data is encapsulated through a very bad independent unique interface by the upper layer developer, assuming).

Question: Can this LLM still solve it in such cases?

Looking Forward to Your Thoughts

I don’t know what your thoughts are?

Or, if it’s really that extreme, do we need to use vectorized field names?

Your instinct is correct; structured data should not use vector retrieval for field selection. Vector embeddings are designed for semantic similarity in unstructured text, not for precise field mapping in structured schemas. For your 255-field KYC API scenario, I have another recommendation which is two-stage hierarchical LLM pipeline.


Why not vector retrieval for fields?

  • Vector similarity is approximate — you need exact field selection

  • Embeddings don’t understand field semantics (e.g., registered_capital vs paid_in_capital might embed similarly but mean different things)

  • You lose the ability to apply business logic (e.g., “always include operating_status when querying legal fields”)


The two-stage LLM approach

Stage 1: Category selection

The first LLM call identifies which high-level categories are relevant to the user’s query (Again I am just making up an example for better clarity).

CATEGORY_SCHEMA = """
registration: Company name, incorporation date, registered address, business scope
financials: Capital, revenue, credit rating, tax information
legal: Licenses, litigation, penalties, compliance status
personnel: Legal representative, shareholders, beneficial owners, board members
operations: Business activities, branches, subsidiaries, partnerships
"""

async def select_categories(query: str) -> list[str]:
    """LLM selects relevant categories from user query."""
    prompt = f"""Given this query: "{query}"
    
    Return relevant categories from: {CATEGORY_SCHEMA}
    
    Output only a JSON array of category names, e.g. ["legal", "personnel"]
    """
    response = await llm.ainvoke(prompt)
    return parse_json(response)  # ["legal", "personnel"]


Stage 2: Field selection within categories

The second LLM call operates only on the fields within the selected categories.

FIELD_DEFINITIONS = {
    "legal": {
        "licenses": "Business licenses and permits held by the company",
        "litigation_records": "Ongoing or historical lawsuits",
        "administrative_penalties": "Fines or sanctions from regulators",
        "bankruptcy_status": "Bankruptcy filings or insolvency proceedings"
    },
    "personnel": {
        "legal_representative": "Primary legal representative of the company",
        "shareholders": "List of shareholders and ownership percentages",
        "beneficial_owners": "Ultimate beneficial owners (UBO)",
        "board_members": "Members of the board of directors"
    }
    # ... other categories
}

async def select_fields(query: str, categories: list[str]) -> list[str]:
    """LLM selects specific fields within the chosen categories."""
    # Build schema only for selected categories
    relevant_schema = {}
    for cat in categories:
        relevant_schema.update(FIELD_DEFINITIONS[cat])
    
    prompt = f"""Given this query: "{query}"
    
    Select relevant fields from:
    {format_schema(relevant_schema)}
    
    Output only a JSON array of field names.
    """
    response = await llm.ainvoke(prompt)
    return parse_json(response)  # ["litigation_records", "administrative_penalties", "beneficial_owners"]


Combined pipeline

async def query_field(company_name: str, query: str) -> str:
    # Stage 1: Select categories
    categories = await select_categories(query)
    
    # Stage 2: Select fields within categories
    fields = await select_fields(query, categories)
    
    if not fields:
        return f"No relevant fields found for: '{query}'"
    
    # Fetch and format data
    result = await api.kyc(company_name)
    data = result["result"]["Data"]
    
    lines = []
    for field in fields:
        value = deep_get(data, field)
        if value == "unknown":
            lines.append(f"【{field}】No data available")
        else:
            lines.append(f"【{field}】")
            lines.append(format_value(value))
    
    return "\n".join(lines)


Wrapping as an agent tool

The agent sees a clean, intent-focused interface:

from langchain_core.tools import tool

@tool
async def query_company_fields(company_name: str, query: str) -> str:
    """Query specific information from a company's KYC profile.

    This tool searches across company registration, financial, legal, personnel,
    and operational data. It automatically identifies relevant fields based on
    your natural language query.

    Example queries:
      - "What is the registered address and incorporation date?"
      - "Who are the beneficial owners and their nationalities?"
      - "Is the company currently active? Any legal issues?"
      - "What licenses does the company hold?"
      - "Show me the company's capital structure and credit rating"

    Args:
        company_name: The company to look up.
        query: Natural language description of the information needed.

    Returns:
        Formatted field values from the company's KYC record.
    """
    return await query_field(company_name, query)

“Multiple identical categories from different providers”

If you’re aggregating fields from multiple APIs (e.g., Provider1’s company_info, Provider2’s company_info), namespace them in your schema:

FIELD_DEFINITIONS = {
    "registration_provider1": {
        "provider1_registered_name": "...",
        "provider1_incorporation_date": "..."
    },
    "registration_provider2": {
        "provider2_registered_name": "...",
        "provider2_incorporation_date": "..."
    }
}

The LLM can reason about which provider is relevant based on the query context or you can merge them into canonical fields if they’re semantically identical.


“Can LLM handle 10,000 fields?”

With two-stage selection: yes. Stage 1 reduces 10,000 fields to ~100–200 based on category. Stage 2 operates on that subset. Modern LLMs (GPT-4, Claude 3.5) can easily handle schemas of this size, especially with structured output modes (JSON schema constraints).


“Should I use vector retrieval for field names?”

Only if you’ve exhausted hierarchical LLM selection and are still hitting context limits (unlikely below 50,000 fields). Vector retrieval introduces approximation error that’s unnecessary when you have clean structured metadata. Reserve it for unstructured content (documents, images, Q&A pairs) where semantic similarity is the goal.

Thank you for your professional response. Best regards :partying_face:

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.