I don't understand the following behavior of Langchain recursive text splitter. Here is my code and output.
from langchain.text_splitter import RecursiveCharacterTextSplitter
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=10,
chunk_overlap=0,
# separators=["\n"]#, "\n", " ", ""]
)
test = """a\nbcefg\nhij\nk"""
print(len(test))
tmp = r_splitter.split_text(test)
print(tmp)
Output
13
['a\nbcefg', 'hij\nk']
As you can see, it outputs chunks of size 7 and 5 and only splits on one of the new line characters. I was expecting output to be ['a','bcefg','hij','k']
Accord to the split_text funcion in RecursiveCharacterTextSplitter
def split_text(self, text: str) -> List[str]:
"""Split incoming text and return chunks."""
final_chunks = []
# Get appropriate separator to use
separator = self._separators[-1]
for _s in self._separators:
if _s == "":
separator = _s
break
if _s in text:
separator = _s
break
# Now that we have the separator, split the text
if separator:
splits = text.split(separator)
else:
splits = list(text)
# Now go merging things, recursively splitting longer texts.
_good_splits = []
for s in splits:
if self._length_function(s) < self._chunk_size:
_good_splits.append(s)
else:
if _good_splits:
merged_text = self._merge_splits(_good_splits, separator)
final_chunks.extend(merged_text)
_good_splits = []
other_info = self.split_text(s)
final_chunks.extend(other_info)
if _good_splits:
merged_text = self._merge_splits(_good_splits, separator) # Here will merge the items if the cusum is less than chunk size in your example is 10
final_chunks.extend(merged_text)
return final_chunks
this will merge the items if the cusum is less than chunk size in your example is 10