While pre-processing text by removal of special characters followed by removal of stop words, words such as add-on
and non-committal
get converted to add
and committal
respectively. What is the best approach to handle these cases?
The "best" approach depends on what the intended application is and how you want to handle context and meaning of words. Generally, hyphenated words have a distinct meaning that wouldn't be evident if any part were removed. For example, "add-on" is treated as noun, while "add" is a verb. Similarly "committal" and "non-committal" have contrary meaning. Note that most stopword lists do not include "non" as a stop word.
The following solution makes the assumption that you'd like to treat hyphenated words as a whole and not individual parts, yet still remove non-alpha characters and stop words. This is done by:
The last step also handles cases where the original text fails to add a hyphen between "non" and the subsequent word (e.g. "non starter"). Additionally, I've included the option to keep numbers if you desire. Just uncomment the the parts of code where you see # to include nums
.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import contractions
text = "The $1,050 add-on was appreciated! It saved me some $$$. However, \
he seemed non-committal about the whole situation -- \
something which didn't sit right with me. Should it? For some it's a non starter."
my_stopwords = stopwords.words("english") # Create stop words to remove
expanded_text = contractions.fix(text) # Expand contractions
tokens = word_tokenize(expanded_text) # Tokenize text
filtered_tokens = [w.lower() for w in tokens if not w.lower() in my_stopwords] # Remove stop words
# Function returns true if char is in allow unicode range
def allowable_char(char):
return (65 <= ord(char) <= 90) or (97 <= ord(char) <= 122) # or (48 <= ord(char) <= 57) # to include nums
# Function returns boolean array corresponding to allowable chars in string
def al_num_ords(string):
return [allowable_char(c) for c in string]
# Remove tokens that contain only non alpha characters
only_al_num_tokens = [tok for tok in filtered_tokens if any(al_num_ords(tok))]
# Collapse hyphenated words & handle occurrences of "non" without hyphenation of subsequent word
processed_text = []
found_unhyphenated = 0
for i, tok in enumerate(only_al_num_tokens):
if tok == "non":
processed_text.append(tok + only_al_num_tokens[i+1])
found_unhyphenated = 1
elif not found_unhyphenated:
processed_text.append("".join(tok.split("-")))
# processed_text.append("".join(tok.replace(",", "-").split("-"))) # to include nums
found_unhyphenated = 0
print(processed_text)
Alpha characters only
['addon', 'appreciated', 'saved', 'however', 'seemed', 'noncommittal', 'whole', 'situation', 'something', 'sit', 'right', 'nonstarter']
Alphanumerical characters only
['1050', 'addon', 'appreciated', 'saved', 'however', 'seemed', 'noncommittal', 'whole', 'situation', 'something', 'sit', 'right', 'nonstarter']