Search code examples
pythonnlpspacystop-words

Handling stop words that are part of hyphenated words while preprocessing text


While pre-processing text by removal of special characters followed by removal of stop words, words such as add-on and non-committal get converted to add and committal respectively. What is the best approach to handle these cases?


Solution

  • The "best" approach depends on what the intended application is and how you want to handle context and meaning of words. Generally, hyphenated words have a distinct meaning that wouldn't be evident if any part were removed. For example, "add-on" is treated as noun, while "add" is a verb. Similarly "committal" and "non-committal" have contrary meaning. Note that most stopword lists do not include "non" as a stop word.

    The following solution makes the assumption that you'd like to treat hyphenated words as a whole and not individual parts, yet still remove non-alpha characters and stop words. This is done by:

    1. expanding contractions,
    2. removing stop words,
    3. removing non-alpha characters, and then
    4. collapsing hyphenated words.

    The last step also handles cases where the original text fails to add a hyphen between "non" and the subsequent word (e.g. "non starter"). Additionally, I've included the option to keep numbers if you desire. Just uncomment the the parts of code where you see # to include nums.

    Solution

    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    import contractions
    
    text = "The $1,050 add-on was appreciated! It saved me some $$$. However, \
    he seemed non-committal about the whole situation -- \
    something which didn't sit right with me. Should it? For some it's a non starter."
    
    my_stopwords = stopwords.words("english")  # Create stop words to remove
    expanded_text = contractions.fix(text)  # Expand contractions
    tokens = word_tokenize(expanded_text)  # Tokenize text
    filtered_tokens = [w.lower() for w in tokens if not w.lower() in my_stopwords]  # Remove stop words
    
    
    # Function returns true if char is in allow unicode range
    def allowable_char(char):
        return (65 <= ord(char) <= 90) or (97 <= ord(char) <= 122)  # or (48 <= ord(char) <= 57)  # to include nums
    
    
    # Function returns boolean array corresponding to allowable chars in string
    def al_num_ords(string):
        return [allowable_char(c) for c in string]
    
    
    # Remove tokens that contain only non alpha characters
    only_al_num_tokens = [tok for tok in filtered_tokens if any(al_num_ords(tok))]
    
    # Collapse hyphenated words & handle occurrences of "non" without hyphenation of subsequent word
    processed_text = []
    found_unhyphenated = 0
    for i, tok in enumerate(only_al_num_tokens):
        if tok == "non":
            processed_text.append(tok + only_al_num_tokens[i+1])
            found_unhyphenated = 1
        elif not found_unhyphenated:
            processed_text.append("".join(tok.split("-")))
            # processed_text.append("".join(tok.replace(",", "-").split("-")))  # to include nums
            found_unhyphenated = 0
    
    print(processed_text)
    

    Output

    Alpha characters only

    ['addon', 'appreciated', 'saved', 'however', 'seemed', 'noncommittal', 'whole', 'situation', 'something', 'sit', 'right', 'nonstarter']
    

    Alphanumerical characters only

    ['1050', 'addon', 'appreciated', 'saved', 'however', 'seemed', 'noncommittal', 'whole', 'situation', 'something', 'sit', 'right', 'nonstarter']