Let's suppose I have a long text that I want to process with an API having a maximum number of allowed characters (N). I would like to split that text into 2 or more texts with shorter than N characters, and based on a separator. I know I could split by separator but I would like to keep the number of output sub-texts the smallest as possible.
For example, suppose my text is:
"Lorem ipsum dolor sit amet, odio salutandi id nam, ferri nostro te duo. Eum ex odio habeo qualisque, ne eos natum graeco. Autem voluptatum ex mea. Nulla putent reformidans cu pro, posse recusabo reprehendunt pro no. An sit ludus oblique. Consulatu cotidieque ex sea, nam no duis prompta expetendis.
Est ne tempor quaestio complectitur, modo error vim et. Option voluptaria efficiantur te eam, ea appareat evertitur qui, te vix pertinax recteque. Mea eu diceret ceteros. Expetenda torquatos assueverit est ex, te reque voluptatibus signiferumque has."
which is 550 characters long. Let's suppose that N is 250. I would expect the text to be split in this way:
Part 1: "Lorem ipsum dolor sit amet, odio salutandi id nam, ferri nostro te duo. Eum ex odio habeo qualisque, ne eos natum graeco. Autem voluptatum ex mea. Nulla putent reformidans cu pro, posse recusabo reprehendunt pro no. An sit ludus oblique" (237 characters)
Part 2: "Consulatu cotidieque ex sea, nam no duis prompta expetendis.
Est ne tempor quaestio complectitur, modo error vim et. Option voluptaria efficiantur te eam, ea appareat evertitur qui, te vix pertinax recteque. Mea eu diceret ceteros." (232 characters)
Any idea on how to do this in Python?
Thank you for any help. Francesca
You can do that using regex:
import re
ouput = re.findall(r".{1,250}(?:\.|$)", data)
print(ouput)
.{1,250}
: Matches any character between 1 and 250 times, as many times as possible.\.
: Matches a dot.|
: Or$
: Matches the end of the string.You can also put the delimiter and the maximum length in a variable.
import re
num_max = 250
delimiter = re.escape('.')
ouput = re.findall(fr".{{1,{num_max}}}(?:{delimiter}|$)", data)
print(ouput)
Output:
[
'Lorem ipsum dolor sit amet, odio salutandi id nam, ferri nostro te duo. Eum ex odio habeo qualisque, ne eos natum graeco. Autem voluptatum ex mea. Nulla putent reformidans cu pro, posse recusabo reprehendunt pro no. An sit ludus oblique.',
' Consulatu cotidieque ex sea, nam no duis prompta expetendis.',
'Est ne tempor quaestio complectitur, modo error vim et. Option voluptaria efficiantur te eam, ea appareat evertitur qui, te vix pertinax recteque. Mea eu diceret ceteros. Expetenda torquatos assueverit est ex, te reque voluptatibus signiferumque has.'
]