Hi! It’s not very clear to me how the order of the items in the separator list, which is a parameter for RecursiveCharacterText, works. The docs are very brief RecursiveCharacterTextSplitter — 🦜🔗 LangChain 0.0.149 . It’s only clear that the chunks aren’t split at every item and the order seems to point towards some sort of priority. But how exactly does it decide to split the input text at one of the default items in the separator list, ["\n\n", "\n", " ", ""]?
Thank you
It tries to split in the order of the separator list.
First it will try to split by the first item "\n\n"
Then if any of the chunks are > max_chunk_size, the next separator in the list is used on that chunk to try and reduce that chunk’s size.
Thank you! But how long does it wait till trying the next separator? Because if it surpasses max_chunk_size, it could try each separator and immediately go to "" and split the string right there. It seems to me that the string is split sometimes earlier than max_chunk_size and sometimes much later, but I don’t understand how it decides when to use each separator once the index approaches max_chunk_size. In other words, how do I know the real min and max chunk size before another separator is picked? For example, with the separator list [". ", " "], how can I know the min and max string size to change the sparator from dot to space? Thanks again
There’s not a concept of “waiting”. The full implementation is here, you can TAL for full detail.
You can think of it like it like this
First, we split on the first separator.
Then we iterate through the chunks of the document.
When we hit a chunk that is too big > max chunk size, then:
- we merge all of the small acceptable chunks so far, getting close to the desired chunk size. We flush those into a “final chunks” list that has been finalized.
- we then use the next separator on the too big chunk to break it smaller, and continue
You can check out the source code for the exact logic, but this is general overview!