Using re to sanitize a word file, allowing letters with hyphens and apostrophes

Here's what I have so far:

import re

def read_file(file):
    words = []
    for line in file:
        for word in line.split():
            words.append(re.sub("[^a-z]", "", word.lower()))

As it stands, this will read in "can't" as "cant" and "co-ordinate" as "coordinate". I want to read in the words so that these 2 punctuation marks are allowed. How do I modify my code to do this?

Solution

There can be two approaches: one is suggested by ritesht93 in the comment to the question, though I'd use

words.append(re.sub("[^-'a-z]+", "", word.lower()))
                       ^^    ^ - One or more occurrences to remove in one go
                        | - Apostrophe and hyphen added

The + quantifier will remove the unwanted characters matching the pattern in one go.

Note that the hyphen is added at the beginning of the negated character class and thus does not have to be escaped. NOTE: It is still recommended to escape it if other, less regex-savvy developers are going to maintain this later.

The second approach will be helpful if you have Unicode letters.

ur'((?![-'])[\W\d_])+'

See the regex demo (to be compiled with re.UNICODE flag)

The pattern matches any non-letter (except a hyphen or an apostrophe due to the negative lookahead (?![-'])), any digit or underscore ([\W\d_])