Search code examples
pythonpython-re

Split text into chunks by ensuring the entireness of words


I have a bunch of text samples. Each sample has a different length, but all of them consist of >200 characters. I need to split each sample into approx 50 chara ters length substrings. To do so, I found this approach:

import re

def chunkstring(string, length):
    return re.findall('.{%d}' % length, string)

However, it splits a text by splitting words. For example, the phrase "I have <...> icecream. <...>" can be split into "I have <...> icec" and "ream. <...>".

This is the sample text:

This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN.

I get this result:

['This paper proposes a method that allows non-paral',
 'lel many-to-many voice conversion by using a varia',
 'nt of a generative adversarial network called Star']

But ideally I would like to get something similar to this result:

['This paper proposes a method that allows non-parallel',
 'many-to-many voice conversion by using a variant',
 'of a generative adversarial network called StarGAN.']

How could I adjust the above-given code to get the desired result?


Solution

  • For me this sound like task for textwrap built-in module, example using your data

    import textwrap
    text = "This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN."
    print(textwrap.fill(text,55))
    

    output

    This paper proposes a method that allows non-parallel
    many-to-many voice conversion by using a variant of a
    generative adversarial network called StarGAN.
    

    You will probably need some trials to get value which suits your needs best. If you need list of strs use textwrap.wrap i.e. textwrap.wrap(text,55)