Search code examples
pythonregexnltktokenize

Match hyphen in combination with new line character


import re
string = re.sub(r'-\n', '', string)

I want to tokenize words of a text. The problem is, that all words, which are at the end of a line, are tokenized wrong. So i have to remove the hyphen before a new line character.

Thanks for your help!


Solution

  • Try using a lookahead to identify the newline, rather than including it in part of the sub operation:

    string = re.sub(r'-(?=\n)', '', string)
    

    Demo