Search code examples
pythonlangchainpy-langchain

Langchain: text splitter behavior


I don't understand the following behavior of Langchain recursive text splitter. Here is my code and output.

from langchain.text_splitter import RecursiveCharacterTextSplitter
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=10,
    chunk_overlap=0,
#     separators=["\n"]#, "\n", " ", ""]
)
test = """a\nbcefg\nhij\nk"""
print(len(test))
tmp = r_splitter.split_text(test)
print(tmp)

Output

13
['a\nbcefg', 'hij\nk']

As you can see, it outputs chunks of size 7 and 5 and only splits on one of the new line characters. I was expecting output to be ['a','bcefg','hij','k']


Solution

  • Accord to the split_text funcion in RecursiveCharacterTextSplitter

    def split_text(self, text: str) -> List[str]:
        """Split incoming text and return chunks."""
        final_chunks = []
        # Get appropriate separator to use
        separator = self._separators[-1]
        for _s in self._separators:
            if _s == "":
                separator = _s
                break
            if _s in text:
                separator = _s
                break
        # Now that we have the separator, split the text
        if separator:
            splits = text.split(separator)
        else:
            splits = list(text)
        # Now go merging things, recursively splitting longer texts.
        _good_splits = []
        for s in splits:
            if self._length_function(s) < self._chunk_size:
                _good_splits.append(s)
            else:
                if _good_splits:
                    merged_text = self._merge_splits(_good_splits, separator)
                    final_chunks.extend(merged_text)
                    _good_splits = []
                other_info = self.split_text(s)
                final_chunks.extend(other_info)
        if _good_splits:
            merged_text = self._merge_splits(_good_splits, separator)  # Here will merge the items if the cusum is less than chunk size in your example is 10
            final_chunks.extend(merged_text)
        return final_chunks
    

    this will merge the items if the cusum is less than chunk size in your example is 10