Handling stop words that are part of hyphenated words while preprocessing text

While pre-processing text by removal of special characters followed by removal of stop words, words such as add-on and non-committal get converted to add and committal respectively. What is the best approach to handle these cases?

Solution

The "best" approach depends on what the intended application is and how you want to handle context and meaning of words. Generally, hyphenated words have a distinct meaning that wouldn't be evident if any part were removed. For example, "add-on" is treated as noun, while "add" is a verb. Similarly "committal" and "non-committal" have contrary meaning. Note that most stopword lists do not include "non" as a stop word.

The following solution makes the assumption that you'd like to treat hyphenated words as a whole and not individual parts, yet still remove non-alpha characters and stop words. This is done by:

expanding contractions,
removing stop words,
removing non-alpha characters, and then
collapsing hyphenated words.

The last step also handles cases where the original text fails to add a hyphen between "non" and the subsequent word (e.g. "non starter"). Additionally, I've included the option to keep numbers if you desire. Just uncomment the the parts of code where you see # to include nums.

Solution

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import contractions

text = "The $1,050 add-on was appreciated! It saved me some $$$. However, \
he seemed non-committal about the whole situation -- \
something which didn't sit right with me. Should it? For some it's a non starter."

my_stopwords = stopwords.words("english")  # Create stop words to remove
expanded_text = contractions.fix(text)  # Expand contractions
tokens = word_tokenize(expanded_text)  # Tokenize text
filtered_tokens = [w.lower() for w in tokens if not w.lower() in my_stopwords]  # Remove stop words


# Function returns true if char is in allow unicode range
def allowable_char(char):
    return (65 <= ord(char) <= 90) or (97 <= ord(char) <= 122)  # or (48 <= ord(char) <= 57)  # to include nums


# Function returns boolean array corresponding to allowable chars in string
def al_num_ords(string):
    return [allowable_char(c) for c in string]


# Remove tokens that contain only non alpha characters
only_al_num_tokens = [tok for tok in filtered_tokens if any(al_num_ords(tok))]

# Collapse hyphenated words & handle occurrences of "non" without hyphenation of subsequent word
processed_text = []
found_unhyphenated = 0
for i, tok in enumerate(only_al_num_tokens):
    if tok == "non":
        processed_text.append(tok + only_al_num_tokens[i+1])
        found_unhyphenated = 1
    elif not found_unhyphenated:
        processed_text.append("".join(tok.split("-")))
        # processed_text.append("".join(tok.replace(",", "-").split("-")))  # to include nums
        found_unhyphenated = 0

print(processed_text)

Output

Alpha characters only

['addon', 'appreciated', 'saved', 'however', 'seemed', 'noncommittal', 'whole', 'situation', 'something', 'sit', 'right', 'nonstarter']

Alphanumerical characters only

['1050', 'addon', 'appreciated', 'saved', 'however', 'seemed', 'noncommittal', 'whole', 'situation', 'something', 'sit', 'right', 'nonstarter']