Here's what I have so far:
import re
def read_file(file):
words = []
for line in file:
for word in line.split():
words.append(re.sub("[^a-z]", "", word.lower()))
As it stands, this will read in "can't" as "cant" and "co-ordinate" as "coordinate". I want to read in the words so that these 2 punctuation marks are allowed. How do I modify my code to do this?
There can be two approaches: one is suggested by ritesht93 in the comment to the question, though I'd use
words.append(re.sub("[^-'a-z]+", "", word.lower()))
^^ ^ - One or more occurrences to remove in one go
| - Apostrophe and hyphen added
The +
quantifier will remove the unwanted characters matching the pattern in one go.
Note that the hyphen is added at the beginning of the negated character class and thus does not have to be escaped. NOTE: It is still recommended to escape it if other, less regex-savvy developers are going to maintain this later.
The second approach will be helpful if you have Unicode letters.
ur'((?![-'])[\W\d_])+'
See the regex demo (to be compiled with re.UNICODE
flag)
The pattern matches any non-letter (except a hyphen or an apostrophe due to the negative lookahead (?![-'])
), any digit or underscore ([\W\d_]
)