Search code examples
pythonmachine-learningtextnlplangchain

What does langchain CharacterTextSplitter's chunk_size param even do?


My default assumption was that the chunk_size parameter would set a ceiling on the size of the chunks/splits that come out of the split_text method, but that's clearly not right:

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size = 6
chunk_overlap = 2

c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

text = 'abcdefghijklmnopqrstuvwxyz'

c_splitter.split_text(text)

prints: ['abcdefghijklmnopqrstuvwxyz'], i.e. one single chunk that is much larger than chunk_size=6.

So I understand that it didn't split the text into chunks because it never encountered the separator. But so then the question is what is the chunk_size even doing?

I checked the documentation page for langchain.text_splitter.CharacterTextSplitter here but did not see an answer to this question. And I asked the "mendable" chat-with-langchain-docs search functionality, but got the answer "The chunk_size parameter of the CharacterTextSplitter determines the maximum number of characters in each chunk of text."...which is not true, as the code sample above shows.


Solution

  • CharacterTextSplitter will only split on separator (which is '\n\n' by default). chunk_size is the maximum chunk size that will be split if splitting is possible. If a string starts with n characters, has a separator, and has m more characters before the next separator then the first chunk size will be n if chunk_size < n + m + len(separator).

    Your example string has no matching separators so there's nothing to split on.

    Basically, it attempts to make chunks that are <= chunk_size, but will still produce chunks > chunk_size if the minimum size chunks that can be created are > chunk_size.