I am trying to clean up text using a pre-processing function. I want to remove all non-alpha characters such as punctuation and digits, but I would like to retain compound words that use a dash without splitting them (e.g. pre-tender, pre-construction).
def preprocess(text):
#remove punctuation
text = re.sub('\b[A-Za-z]+(?:-+[A-Za-z]+)+\b', '-', text)
text = re.sub('[^a-zA-Z]', ' ', text)
text = text.split()
text = " ".join(text)
return text
For instance, the original text:
"Attended pre-tender meetings"
should be split into
['attended', 'pre-tender', 'meeting']
rather than
['attended', 'pre', 'tender', 'meeting']
Any help would be appreciated!
To remove all non-alpha characters but -
between letters, you can use
[\W\d_](?<![^\W\d_]-(?=[^\W\d_]))
ASCII only equivalent:
[^A-Za-z](?<![A-Za-z]-(?=[A-Za-z]))
See the regex demo. Details:
[\W\d_]
- any non-letter(?<![^\W\d_]-(?=[^\W\d_]))
- a negative lookbehind that fails the match if there is a letter and a -
immediately to the left, and right after -
, there is any letter (checked with the (?=[^\W\d_])
positive lookahead).See the Python demo:
import re
def preprocess(text):
#remove all non-alpha characters but - between letters
text = re.sub(r'[\W\d_](?<![^\W\d_]-(?=[^\W\d_]))', r' ', text)
return " ".join(text.split())
print(preprocess("Attended pre-tender, etc meetings."))
# => Attended pre-tender etc meetings