Your instinct is correct; structured data should not use vector retrieval for field selection. Vector embeddings are designed for semantic similarity in unstructured text, not for precise field mapping in structured schemas. For your 255-field KYC API scenario, I have another recommendation which is two-stage hierarchical LLM pipeline.
Why not vector retrieval for fields?
-
Vector similarity is approximate — you need exact field selection
-
Embeddings don’t understand field semantics (e.g., registered_capital vs paid_in_capital might embed similarly but mean different things)
-
You lose the ability to apply business logic (e.g., “always include operating_status when querying legal fields”)
The two-stage LLM approach
Stage 1: Category selection
The first LLM call identifies which high-level categories are relevant to the user’s query (Again I am just making up an example for better clarity).
CATEGORY_SCHEMA = """
registration: Company name, incorporation date, registered address, business scope
financials: Capital, revenue, credit rating, tax information
legal: Licenses, litigation, penalties, compliance status
personnel: Legal representative, shareholders, beneficial owners, board members
operations: Business activities, branches, subsidiaries, partnerships
"""
async def select_categories(query: str) -> list[str]:
"""LLM selects relevant categories from user query."""
prompt = f"""Given this query: "{query}"
Return relevant categories from: {CATEGORY_SCHEMA}
Output only a JSON array of category names, e.g. ["legal", "personnel"]
"""
response = await llm.ainvoke(prompt)
return parse_json(response) # ["legal", "personnel"]
Stage 2: Field selection within categories
The second LLM call operates only on the fields within the selected categories.
FIELD_DEFINITIONS = {
"legal": {
"licenses": "Business licenses and permits held by the company",
"litigation_records": "Ongoing or historical lawsuits",
"administrative_penalties": "Fines or sanctions from regulators",
"bankruptcy_status": "Bankruptcy filings or insolvency proceedings"
},
"personnel": {
"legal_representative": "Primary legal representative of the company",
"shareholders": "List of shareholders and ownership percentages",
"beneficial_owners": "Ultimate beneficial owners (UBO)",
"board_members": "Members of the board of directors"
}
# ... other categories
}
async def select_fields(query: str, categories: list[str]) -> list[str]:
"""LLM selects specific fields within the chosen categories."""
# Build schema only for selected categories
relevant_schema = {}
for cat in categories:
relevant_schema.update(FIELD_DEFINITIONS[cat])
prompt = f"""Given this query: "{query}"
Select relevant fields from:
{format_schema(relevant_schema)}
Output only a JSON array of field names.
"""
response = await llm.ainvoke(prompt)
return parse_json(response) # ["litigation_records", "administrative_penalties", "beneficial_owners"]
Combined pipeline
async def query_field(company_name: str, query: str) -> str:
# Stage 1: Select categories
categories = await select_categories(query)
# Stage 2: Select fields within categories
fields = await select_fields(query, categories)
if not fields:
return f"No relevant fields found for: '{query}'"
# Fetch and format data
result = await api.kyc(company_name)
data = result["result"]["Data"]
lines = []
for field in fields:
value = deep_get(data, field)
if value == "unknown":
lines.append(f"【{field}】No data available")
else:
lines.append(f"【{field}】")
lines.append(format_value(value))
return "\n".join(lines)
Wrapping as an agent tool
The agent sees a clean, intent-focused interface:
from langchain_core.tools import tool
@tool
async def query_company_fields(company_name: str, query: str) -> str:
"""Query specific information from a company's KYC profile.
This tool searches across company registration, financial, legal, personnel,
and operational data. It automatically identifies relevant fields based on
your natural language query.
Example queries:
- "What is the registered address and incorporation date?"
- "Who are the beneficial owners and their nationalities?"
- "Is the company currently active? Any legal issues?"
- "What licenses does the company hold?"
- "Show me the company's capital structure and credit rating"
Args:
company_name: The company to look up.
query: Natural language description of the information needed.
Returns:
Formatted field values from the company's KYC record.
"""
return await query_field(company_name, query)
“Multiple identical categories from different providers”
If you’re aggregating fields from multiple APIs (e.g., Provider1’s company_info, Provider2’s company_info), namespace them in your schema:
FIELD_DEFINITIONS = {
"registration_provider1": {
"provider1_registered_name": "...",
"provider1_incorporation_date": "..."
},
"registration_provider2": {
"provider2_registered_name": "...",
"provider2_incorporation_date": "..."
}
}
The LLM can reason about which provider is relevant based on the query context or you can merge them into canonical fields if they’re semantically identical.
“Can LLM handle 10,000 fields?”
With two-stage selection: yes. Stage 1 reduces 10,000 fields to ~100–200 based on category. Stage 2 operates on that subset. Modern LLMs (GPT-4, Claude 3.5) can easily handle schemas of this size, especially with structured output modes (JSON schema constraints).
“Should I use vector retrieval for field names?”
Only if you’ve exhausted hierarchical LLM selection and are still hitting context limits (unlikely below 50,000 fields). Vector retrieval introduces approximation error that’s unnecessary when you have clean structured metadata. Reserve it for unstructured content (documents, images, Q&A pairs) where semantic similarity is the goal.