Search code examples
pythonword-wrap

python textwrap breaking sentences in wrong places


I'm finding python's textwrap library is breaking sentences in the wrong places. I'm using:

wrp = textwrap.TextWrapper(width=32,break_long_words=False,replace_whitespace=False)
out = '\n'.join(wrp.wrap(txt))

Applying this to the following passage*:

The Caterpillar and Alice looked at each other for some time in silence:
at last the Caterpillar took the hookah out of its mouth, and addressed
her in a languid, sleepy voice.

'Who are YOU?' said the Caterpillar.

This was not an encouraging opening for a conversation. Alice replied,
rather shyly, 'I--I hardly know, sir, just at present--at least I know
who I WAS when I got up this morning, but I think I must have been
changed several times since then.'

The result of the wrap is:

The Caterpillar and Alice looked
at each other for some time in
silence:
at last the
Caterpillar took the hookah out
of its mouth, and addressed
her
in a languid, sleepy voice.
'Who are YOU?' said the
Caterpillar.

This was not an
encouraging opening for a
conversation. Alice replied,
rather shyly, 'I--I hardly know,
sir, just at present--at least I
know
who I WAS when I got up
this morning, but I think I must
have been
changed several times
since then.

A few of the extra breaks are because the original text is already wrapped. But still incorrect breaks have been added at e.g. at last the | Caterpillar, and the last sentence is a complete mess. Can anyone advise how to properly wrap this?

  • passage sourced with curl https://www.gutenberg.org/cache/epub/11/pg11.txt | sed -n 960,969p> alice.txt

Solution

  • Preserving text format: We replace any return followed or preceded by a letter. That ensure text formatting is kept:

    re.sub("([,\w])\n(\w)", "\1 \2", sys.stdin.read())
    

    The Caterpillar and Alice looked at each other for some time in silence:
    at last the Caterpillar took the hookah out of its mouth, and addressed her in a languid, sleepy voice.

    'Who are YOU?' said the Caterpillar.

    This was not an encouraging opening for a conversation. Alice replied, rather shyly, 'I--I hardly know, sir, just at present--at least I know who I WAS when I got up this morning, but I think I must have been changed several times since then.'

    You can then wrap every parts:

    text = re.sub("([,\w])\n(\w)", "\1 \2", sys.stdin.read())
    for part in text.splitlines():
        print '\n'.join(textwrap.wrap(part, width=32))
    

    The Caterpillar and Alice looked
    at each other for some time in
    silence:
    at last the Caterpillar took the
    hookah out of its mouth, and
    addressed her in a languid,
    sleepy voice.

    'Who are YOU?' said the
    Caterpillar.

    This was not an encouraging
    opening for a conversation.
    Alice replied, rather shyly, 'I
    --I hardly know, sir, just at
    present--at least I know who I
    WAS when I got up this morning,
    but I think I must have been
    changed several times since
    then.'