I have a bunch of text samples. Each sample has a different length, but all of them consist of >200 characters. I need to split each sample into approx 50 chara ters length substrings. To do so, I found this approach:
import re
def chunkstring(string, length):
return re.findall('.{%d}' % length, string)
However, it splits a text by splitting words. For example, the phrase "I have <...> icecream. <...>" can be split into "I have <...> icec" and "ream. <...>".
This is the sample text:
This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN.
I get this result:
['This paper proposes a method that allows non-paral',
'lel many-to-many voice conversion by using a varia',
'nt of a generative adversarial network called Star']
But ideally I would like to get something similar to this result:
['This paper proposes a method that allows non-parallel',
'many-to-many voice conversion by using a variant',
'of a generative adversarial network called StarGAN.']
How could I adjust the above-given code to get the desired result?
For me this sound like task for textwrap
built-in module, example using your data
import textwrap
text = "This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN."
print(textwrap.fill(text,55))
output
This paper proposes a method that allows non-parallel
many-to-many voice conversion by using a variant of a
generative adversarial network called StarGAN.
You will probably need some trials to get value which suits your needs best. If you need list
of str
s use textwrap.wrap
i.e. textwrap.wrap(text,55)