GraphDocument with multiple Document sources

Hello there! I am trying to ingest data in Neo4J for a hybrid RAG application.

I have seen some tutorials where you basically do this:

  • Load Document objects through a loader
  • Chunk the document
  • Feed them to a LLM transformer which creates GraphDocument with GraphDocument.source pointing to the original Document chunk
  • Add them to Neo4J using Neo4jGraph.add_graph_document(document, include_source=True)

My JSON structures contains basically Title, Summary, Long Description (some additional irrelevant properties) so I’d like to build my data as:

  • GraphDocument (everything but the description)
  • GraphDocument.source (all the JSON data, formatted and with embeddings for vector queries)

Now, my data is already available in a structured JSON format and I know exactly what the GraphDocument has to contain, I don’t really need to use an LLM to read the JSON structure and infer the structure (it’s slower, expensive, and not deterministic)

I thought to build it manually, however I found an issue I didn’t find any documentation for: GraphDocument.source is a single Document object while inevitably the description will become large at some point and it will need to be chunked in multiple Document objects.

I’m not sure what purpose GraphDocument.source serves (apart from creating the MENTION relationships). I could keep Vector and Graph data separate and create manually such relationships but is it the right approach?

Wouldn’t be better supporting multiple sources in the GraphDocument class?

I agree that from a design perspective, it would be very convenient if the GraphDocument.source attribute could directly support a list of Document objects. This would be an intuitive way to represent a graph derived from a text that has been chunked into multiple pieces.

The current design, however, is still quite powerful and supports this use case . You could enrich the metadata of each document chunk during the ingestion process.

For example, when you split a large document, you can add identifiers to each chunk’s metadata, such as a parent_id (linking to the original document) and chunk_number or next_chunk_id and prev_chunk_id.

Then, at retrieval time, when you fetch a graph node and its corresponding source chunk, your application can inspect this metadata. If the context seems incomplete, you can use the parent_id and chunk_num to retrieve the next (or previous) chunks from your document store to reconstruct the full context for the LLM.