Support of list in self query retriever

Hello,

I have tried SelfQueryRetriever for a few tests. This is a great tool. One thing I find difficult is for some attributes with multiple values like a list of values.

For example, to build a self query retriever on a movie database. Each movie could have multiple genres so I want to describe in the attributeinfo for genre attribute as support of adding multiple strings to this attribute.

I read the doc and it says it supports list of string values. But in reality, I couldn’t.

Please advise what is the right way to add attribute with multiple string values in Self Query Retriver.

Thanks,
V

Hi @valtahomes

it feels like it’s vector store provider related - I’ve created this (with Qdrant) and seems like it works:

from dotenv import load_dotenv
from langchain_classic.chains.query_constructor.schema import AttributeInfo
from langchain_core.documents import Document
from langchain_classic.retrievers.self_query.base import SelfQueryRetriever
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from langchain_core.structured_query import Comparator, Operator

load_dotenv(verbose=True)

metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="List of genres (strings), e.g. ['science fiction', 'adventure']",
        type="list[string]",
    ),
    AttributeInfo(name="year", description="Release year", type="integer"),
]

docs = [
    Document(
        page_content="A thrilling adventure in space.",
        metadata={"genre": ["science fiction", "adventure", "action", "thriller"], "year": 2020},
    ),
]

# print(docs)

embeddings = OpenAIEmbeddings()
# Keep list-valued metadata as-is (e.g., {"genre": ["science fiction", "adventure"]})
texts = [d.page_content for d in docs]
metadatas = [d.metadata for d in docs]

vectorstore = QdrantVectorStore.from_texts(
    texts=texts,
    embedding=embeddings,
    metadatas=metadatas,
    location=":memory:",  # or path="/tmp/qdrant" / host+port for server mode
    collection_name="movies",
)

llm = ChatOpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents="Brief summary of a movie",
    metadata_field_info=metadata_field_info,
    verbose=True,
    chain_kwargs={
        "examples": [
            (
                "Find movies that are of science fiction and action genre at the same time, after 2015",
                {
                    "query": "",
                    "filter": 'and(gt("year", 2015), eq("genre", "science fiction"), eq("genre", "action"))'
                },
            ),
        ],
        # You can also explicitly constrain the grammar (usually auto-set from the translator):
        # "allowed_operators": [Operator.AND, Operator.OR],
        # "allowed_comparators": [Comparator.EQ, Comparator.NE, Comparator.GT, Comparator.GTE, Comparator.LT, Comparator.LTE],
    },
)

result = retriever.invoke("Find movies that are of thriller genre and of action genre at the same time, after 2015")

if len(result) == 0:
    print("No movies found")
else:
    for doc in result:
        print(doc)

Response:

page_content='A thrilling adventure in space.' metadata={'genre': ['science fiction', 'adventure', 'action', 'thriller'], 'year': 2020, '_id': 'db908c84856d4d0e86534e6be1fd95f7', '_collection_name': 'movies'}

Hi @pawel-twardziak

Thanks, this is grea.

How about Chroma and Postgresql’s PGVector? Do these two also support list of strings?

Thanks,
V

Hi @valtahomes

I couldn’t achieve that with ChromaDB, and havent’t played yet with PGVector. I am checking PGVector now.

We have been using Chroma and never succeeded in doing so. Thank you for confirmation on this.

Would love to hear your comments on PGVector. Thanks.

Hi @valtahomes

I am done for today since I am on sick leave. I will continue on that over the coming days. I will keep you in the loop.