I have python reading long lines and then wrapping them if they exceed x characters and writing them to a new file. I figured out how to ensure words are not split apart, but I have a more specific problem. I don't want specific words to ever appear a the start of a line. After hours of research, I have realized I have been running down the wrong path to fix this and need help.
Here's the code I have now:
with txtfile as infile, testfile as outfile:
for line in infile:
if len(line) > 80 and any(word in line[77:] for word in connectives):
outfile.write(textwrap.fill(line,96,replace_whitespace=False))
elif len(line) > 80 and not any(word in line[77:] for word in connectives):
outfile.write(textwrap.fill(line,80,replace_whitespace=False))
else:
outfile.write(line)
A little explanation for what I tried to do: Right now it reads a line of a couple hundred characters and if it's more than 80 characters, it wraps it to 80. I had thought that I would see if the last few characters of the line contained any of the words I am targeting and if so, I would lengthen the wrap for those lines, so that the target word wouldn't get dropped to the next line. But I have realized that was faulty thinking (maybe moronic is better) on my part because the if statement checks that first line of several hundred characters. It doesn't then check the subsequent lines as it wraps. In the end, I can avoid breaking at the wrong word on the first line, but not subsequent lines.
Since textwrap
won't breakup whole words if you don't want it too, I'm hoping there is then a way to also tell it to not allow certain words or characters to be dropped to the next line.
Alternatively, perhaps there is a way to read what was wrapped and anytime a specific word appears as the first word on the line, then move it up to the end of the previous line.
You might be able to hack textwrap
to do what you want, meanwhile here's snippet that does what you want. The basic word-wrapping code is an adaptation of the algorithm in that section of the Wikipedia article titled: Line wrap and word wrap.
When words are encountered that can't be at the beginning of the next line, they're just added to the current one (which technically makes it too long). If you find that unacceptable, at least this will provide you with a code-base for trying other approaches.
import re
def textsplitter(text):
for match_obj in re.finditer(r'\w+\S+', sample_text):
match_str = match_obj.group()
submatch_obj = re.match(r'(\w+)(\S*)', match_str)
yield submatch_obj.groups()
def textwrapper(text, width=79, **kwargs):
taboo = set(kwargs.get('taboo', [])) # Words that can't be first.
result = []
spaceleft = width
for word, suffix in textsplitter(text):
phrase = word + suffix # Note suffix might be empty string ''.
if word in taboo: # Can't be first, so just add it.
result.append(phrase)
spaceleft = 0
else: # Add word, possibly with an inserted linebreak.
if len(phrase) > spaceleft:
result.append('\n'+phrase) # Insert linebreak before word.
spaceleft = width - len(phrase)
else:
result.append(phrase)
spaceleft = spaceleft - (len(phrase) + 1)
return ' '.join(result)
sample_text = """\
Lorem ipsum dolor sit amet, consectetur adipiscing elit. In molestie lectus
nulla, at aliquam dolor suscipit ac. Mauris vitae purus non est vehicula dictum.
Integer varius diam tellus, quis cursus lacus sollicitudin sed. Nulla eu quam
nec felis egestas tristique eu placerat est. Praesent tincidunt libero in
aliquet euismod. Pellentesque eu odio mollis, consequat eros in, vestibulum
mauris. Aenean gravida dolor et ligula cursus laoreet.
"""
print('Wrapped with no taboo words:\n')
print(textwrapper(sample_text, 40))
print('\n'*2)
taboo = ['adipiscing', 'aliquam'] # Not allowed to appear at start of lines.
print('Wrapped again with taboo words {}:\n'.format(taboo))
print(textwrapper(sample_text, 40, taboo=taboo))
Output:
Wrapped with no taboo words:
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. In molestie lectus
nulla, at aliquam dolor suscipit ac.
Mauris vitae purus non est vehicula
dictum. Integer varius diam tellus, quis
cursus lacus sollicitudin sed. Nulla eu
quam nec felis egestas tristique eu
placerat est. Praesent tincidunt libero
in aliquet euismod. Pellentesque eu odio
mollis, consequat eros in, vestibulum
mauris. Aenean gravida dolor et ligula
cursus laoreet.
Wrapped again with taboo words ['adipiscing', 'aliquam']:
Lorem ipsum dolor sit amet, consectetur adipiscing
elit. In molestie lectus nulla, at aliquam
dolor suscipit ac. Mauris vitae purus non
est vehicula dictum. Integer varius diam
tellus, quis cursus lacus sollicitudin
sed. Nulla eu quam nec felis egestas
tristique eu placerat est. Praesent
tincidunt libero in aliquet euismod.
Pellentesque eu odio mollis, consequat
eros in, vestibulum mauris. Aenean
gravida dolor et ligula cursus laoreet.