I have a list of texts and I need to split each text in multiple chunks each shorter than a limit of 5000 bytes.
The idea is to split each text in sentences and then add them back one by one until the limit of 5000 is reached.
This is how far I got (see the code). I'm def doing something wrong. But I spent too much time debugging it so will really appreciate a pair of fresh eyes. Thanks!
To test it you need can use any text larger than 10k bytes.
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
def split_text(text, limit):
sentences = sent_tokenize(text)
def get_chunk(sentences, limit):
results = []
counter = 0
while counter < limit:
for s in sentences:
counter += len(s.encode('utf-8'))
results.append(s)
sentences.remove(s)
return results, sentences
out = []
while len(' '.join(sentences).encode('utf-8')) > limit:
results, sentences = get_chunk(sentences, limit)
out.append(results)
else:
out.append(sentences)
text_out = [' '.join(sentences) for sentences in out]
return text_out
This works:
size = 0
l = [] #list of sentences
ll = [] #list of lists
for s in sent_tokenize(text):
if size + len(s.encode()) <= 5000:
l.append(s)
size += len(s.encode()) + 1 # +1 for space
else:
size = 0
ll.append(l.copy())
l.clear()
# save the remainder (if any):
if l:
ll.append(l.copy())
we can check that the chunks are all of length <= 5000 bytes:
for l in ll:
print(len(' '.join(l).encode()))
#4983
#4987
#4781
#4943
# .. etc ..
#