Append a list until a size limit is reached then start a new list

I have a list of texts and I need to split each text in multiple chunks each shorter than a limit of 5000 bytes.

The idea is to split each text in sentences and then add them back one by one until the limit of 5000 is reached.

This is how far I got (see the code). I'm def doing something wrong. But I spent too much time debugging it so will really appreciate a pair of fresh eyes. Thanks!

To test it you need can use any text larger than 10k bytes.

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def split_text(text, limit):
    sentences = sent_tokenize(text)
    def get_chunk(sentences, limit):
        results = []
        counter = 0
        while counter < limit:
            for s in sentences:
                counter += len(s.encode('utf-8'))
                results.append(s)
                sentences.remove(s)
        return results, sentences

    out = []
    while len(' '.join(sentences).encode('utf-8')) > limit:
        results, sentences = get_chunk(sentences, limit)
        out.append(results)
    else:
        out.append(sentences)
    text_out = [' '.join(sentences) for sentences in out]
    return text_out

Solution

This works:

size = 0
l = [] #list of sentences
ll = [] #list of lists 
for s in sent_tokenize(text):
    if size + len(s.encode()) <= 5000:
        l.append(s)
        size += len(s.encode()) + 1 # +1 for space 
    else:
        size = 0
        ll.append(l.copy())
        l.clear()

# save the remainder (if any):
if l:
    ll.append(l.copy())

we can check that the chunks are all of length <= 5000 bytes:

for l in ll:
    print(len(' '.join(l).encode()))
#4983
#4987
#4781
#4943
# .. etc ..
#