Search code examples
pythoncsvpython-3.xpunctuation

Remove punctuation and create .csv file with list of words, flagged with whether punctuation existed


This is what I have so far:

import re
import csv

outfile1 = open('test_output.csv', 'wt')
outfileWriter1 = csv.writer(outfile1, delimiter=',')

rawtext = open('rawtext.txt', 'r').read()
print(rawtext)

rawtext = rawtext.lower()
print(rawtext)

re.sub('[^A-Za-z0-9]+', '', rawtext)
print(rawtext)

First of all, when I run this the punctuation doesn't get removed so I'm wondering if there's something wrong with my expression?

Secondly, I'm trying to produce a .csv list of all words flagged with whether they had punctuation or not, e.g. a text file reading "Hello! It's a nice day." would output:

ID, PUNCTUATION, WORD
1,  Y,           hello
2,  Y,           its
3,  N,           a
4,  N,           nice
5,  Y,           day

I know I can use .split() to split up the words but other than that I have no idea how to go about this! Any help would be appreciated.


Solution

  • You can do something like this:

    from string import punctuation
    import csv
    
    strs = "Hello! It's a nice day."
    
    with open('abc.csv', 'w') as f:
        writer = csv.writer(f, delimiter=',')
        writer.writerow(['ID', 'PUNCTUATION', 'WORD'])
        #use enumerate to get word as well as index
        table = dict.fromkeys(map(ord, punctuation))
        for i, word in enumerate(strs.split(), 1):
            #str.translate is faster than regex
            new_strs = word.translate(table)
            #if the new word is not equal to original word then use 'Y'
            punc = 'Y' if new_strs != word else 'N'
            writer.writerow([i, punc, new_strs])