Search code examples
pythonword-wrap

How to textwrap.fill text, but prevent specific words from being at the start of the lines?


I have python reading long lines and then wrapping them if they exceed x characters and writing them to a new file. I figured out how to ensure words are not split apart, but I have a more specific problem. I don't want specific words to ever appear a the start of a line. After hours of research, I have realized I have been running down the wrong path to fix this and need help.

Here's the code I have now:

with txtfile as infile, testfile as outfile:
    for line in infile:
        if len(line) > 80 and any(word in line[77:] for word in connectives):
            outfile.write(textwrap.fill(line,96,replace_whitespace=False))
        elif len(line) > 80 and not any(word in line[77:] for word in connectives):
            outfile.write(textwrap.fill(line,80,replace_whitespace=False))
        else:
            outfile.write(line)

A little explanation for what I tried to do: Right now it reads a line of a couple hundred characters and if it's more than 80 characters, it wraps it to 80. I had thought that I would see if the last few characters of the line contained any of the words I am targeting and if so, I would lengthen the wrap for those lines, so that the target word wouldn't get dropped to the next line. But I have realized that was faulty thinking (maybe moronic is better) on my part because the if statement checks that first line of several hundred characters. It doesn't then check the subsequent lines as it wraps. In the end, I can avoid breaking at the wrong word on the first line, but not subsequent lines.

Since textwrap won't breakup whole words if you don't want it too, I'm hoping there is then a way to also tell it to not allow certain words or characters to be dropped to the next line.

Alternatively, perhaps there is a way to read what was wrapped and anytime a specific word appears as the first word on the line, then move it up to the end of the previous line.


Solution

  • You might be able to hack textwrap to do what you want, meanwhile here's snippet that does what you want. The basic word-wrapping code is an adaptation of the algorithm in that section of the Wikipedia article titled: Line wrap and word wrap.

    When words are encountered that can't be at the beginning of the next line, they're just added to the current one (which technically makes it too long). If you find that unacceptable, at least this will provide you with a code-base for trying other approaches.

    import re
    
    def textsplitter(text):
        for match_obj in re.finditer(r'\w+\S+', sample_text):
            match_str = match_obj.group()
            submatch_obj = re.match(r'(\w+)(\S*)', match_str)
            yield submatch_obj.groups()
    
    def textwrapper(text, width=79, **kwargs):
        taboo = set(kwargs.get('taboo', []))  # Words that can't be first.
        result = []
        spaceleft = width
    
        for word, suffix in textsplitter(text):
            phrase = word + suffix  # Note suffix might be empty string ''.
    
            if word in taboo:   # Can't be first, so just add it.
                result.append(phrase)
                spaceleft = 0
            else:               # Add word, possibly with an inserted linebreak.
                if len(phrase) > spaceleft:
                    result.append('\n'+phrase)  # Insert linebreak before word.
                    spaceleft = width - len(phrase)
                else:
                    result.append(phrase)
                    spaceleft = spaceleft - (len(phrase) + 1)
    
        return ' '.join(result)
    
    
    sample_text = """\
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. In molestie lectus
    nulla, at aliquam dolor suscipit ac. Mauris vitae purus non est vehicula dictum.
    Integer varius diam tellus, quis cursus lacus sollicitudin sed. Nulla eu quam
    nec felis egestas tristique eu placerat est. Praesent tincidunt libero in
    aliquet euismod. Pellentesque eu odio mollis, consequat eros in, vestibulum
    mauris. Aenean gravida dolor et ligula cursus laoreet.
    """
    
    print('Wrapped with no taboo words:\n')
    print(textwrapper(sample_text, 40))
    
    print('\n'*2)
    taboo = ['adipiscing', 'aliquam']  # Not allowed to appear at start of lines.
    print('Wrapped again with taboo words {}:\n'.format(taboo))
    print(textwrapper(sample_text, 40, taboo=taboo))
    

    Output:

    Wrapped with no taboo words:
    
    Lorem ipsum dolor sit amet, consectetur
    adipiscing elit. In molestie lectus
    nulla, at aliquam dolor suscipit ac.
    Mauris vitae purus non est vehicula
    dictum. Integer varius diam tellus, quis
    cursus lacus sollicitudin sed. Nulla eu
    quam nec felis egestas tristique eu
    placerat est. Praesent tincidunt libero
    in aliquet euismod. Pellentesque eu odio
    mollis, consequat eros in, vestibulum
    mauris. Aenean gravida dolor et ligula
    cursus laoreet.
    
    
    Wrapped again with taboo words ['adipiscing', 'aliquam']:
    
    Lorem ipsum dolor sit amet, consectetur adipiscing
    elit. In molestie lectus nulla, at aliquam
    dolor suscipit ac. Mauris vitae purus non
    est vehicula dictum. Integer varius diam
    tellus, quis cursus lacus sollicitudin
    sed. Nulla eu quam nec felis egestas
    tristique eu placerat est. Praesent
    tincidunt libero in aliquet euismod.
    Pellentesque eu odio mollis, consequat
    eros in, vestibulum mauris. Aenean
    gravida dolor et ligula cursus laoreet.