Quick question:
I'm using string
and nltk.stopwords
to strip a block of text of all its punctuation and stopwords as part of data pre-processing before feeding it into some natural language processing algorithms.
I've tested each component separately on a couple blocks of raw text because I'm still getting used to this process, and it seemed fine.
def text_process(text):
"""
Takes in string of text, and does following operations:
1. Removes punctuation.
2. Removes stopwords.
3. Returns a list of cleaned "tokenized" text.
"""
nopunc = [char for char in text.lower() if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word not in
stopwords.words('english')]
However, when I apply this function to the text column of my dataframe – it's text from a bunch of Pitchfork reviews – I can see that the punctuation isn't actually being removed, although the stopwords are.
Unprocessed:
pitchfork['content'].head(5)
0 “Trip-hop” eventually became a ’90s punchline,...
1 Eight years, five albums, and two EPs in, the ...
2 Minneapolis’ Uranium Club seem to revel in bei...
3 Minneapolis’ Uranium Club seem to revel in bei...
4 Kleenex began with a crash. It transpired one ...
Name: content, dtype: object
Processed:
pitchfork['content'].head(5).apply(text_process)
0 [“triphop”, eventually, became, ’90s, punchlin...
1 [eight, years, five, albums, two, eps, new, yo...
2 [minneapolis’, uranium, club, seem, revel, agg...
3 [minneapolis’, uranium, club, seem, revel, agg...
4 [kleenex, began, crash, it, transpired, one, n...
Name: content, dtype: object
Any thoughts on what's going wrong here? I've looked through the documentation, and I haven't seen anyone who's struggling with this problem in the exact same manner, so I'd love some insight on how to tackle this. Thanks so much!
The problem here is that utf-8 has different encodings for left and right quotation marks (single and double), rather than just the regular quotation mark that is included in string.punctuation
.
I would do something like
punctuation = [ c for c in string.punctuation ] + [u'\u201c',u'\u201d',u'\u2018',u'\u2019']
nopunc = [ char for char in text.decode('utf-8').lower() if char not in punctuation ]
this adds the utf-8 values for the non-ascii quotation marks to a list called punctuation
, and then decodes the text to utf-8
, and replaces those values.
note: this is python2, if you're using python3, the formatting of the utf values will likely be slightly different