Search code examples
pythonregexnlp

Tokenize text but keep compund hyphenated words together


I am trying to clean up text using a pre-processing function. I want to remove all non-alpha characters such as punctuation and digits, but I would like to retain compound words that use a dash without splitting them (e.g. pre-tender, pre-construction).

def preprocess(text):
  #remove punctuation
  text = re.sub('\b[A-Za-z]+(?:-+[A-Za-z]+)+\b', '-', text)
  text = re.sub('[^a-zA-Z]', ' ', text)
  text = text.split()
  text = " ".join(text)
  return text

For instance, the original text:

"Attended pre-tender meetings" 

should be split into

['attended', 'pre-tender', 'meeting'] 

rather than

['attended', 'pre', 'tender', 'meeting']

Any help would be appreciated!


Solution

  • To remove all non-alpha characters but - between letters, you can use

    [\W\d_](?<![^\W\d_]-(?=[^\W\d_]))
    

    ASCII only equivalent:

    [^A-Za-z](?<![A-Za-z]-(?=[A-Za-z]))
    

    See the regex demo. Details:

    • [\W\d_] - any non-letter
    • (?<![^\W\d_]-(?=[^\W\d_])) - a negative lookbehind that fails the match if there is a letter and a - immediately to the left, and right after -, there is any letter (checked with the (?=[^\W\d_]) positive lookahead).

    See the Python demo:

    import re
    
    def preprocess(text):
      #remove all non-alpha characters but - between letters
      text = re.sub(r'[\W\d_](?<![^\W\d_]-(?=[^\W\d_]))', r' ', text)
      return " ".join(text.split())
    
    print(preprocess("Attended pre-tender, etc meetings."))
    # => Attended pre-tender etc meetings