Search code examples
pythonstringpunctuation

Stripping punctuation from unique strings in an input file


This question ( Best way to strip punctuation from a string in Python ) deals with stripping punctuation from an individual string. However, I'm hoping to read text from an input file, but only print out ONE COPY of all strings without ending punctuation. I have started something like this:

f = open('#file name ...', 'a+')
for x in set(f.read().split()):
    print x

But the problem is that if the input file has, for instance, this line:

This is not is, clearly is: weird

It treats the three different cases of "is" differently, but I want to ignore any punctuation and have it print "is" only once, rather than three times. How do I remove any kind of ending punctuation and then put the resulting string in the set?

Thanks for any help. (I am really new to Python.)


Solution

  • import re
    
    for x in set(re.findall(r'\b\w+\b', f.read())):
    

    should be more able to distinguish words correctly.

    This regular expression finds compact groups of alphanumerical characters (a-z, A-Z, 0-9, _).

    If you want to find letters only (no digits and no underscore), then replace the \w with [a-zA-Z].

    >>> re.findall(r'\b\w+\b', "This is not is, clearly is: weird")
    ['This', 'is', 'not', 'is', 'clearly', 'is', 'weird']