Search code examples
pythonstringnlpspecial-characterscorpus

Removing words in text files containing a character or string of letters with Python


I have a few lines of text and want to remove any word with special characters or a fixed given string in them (in python).

Example:

in_lines = ['this is go:od', 
            'that example is bad', 
            'amp is a word']

# remove any word with {'amp', ':'}
out_lines = ['this is', 
             'that is bad', 
             'is a word']

I know how to remove words from a list that is given but cannot remove words with special characters or few letters being present. Please let me know and I'll add more information.

This is what I have for removing selected words:

def remove_stop_words(lines):
   stop_words = ['am', 'is', 'are']
   results = []
   for text in lines:
        tmp = text.split(' ')
        for stop_word in stop_words:
            for x in range(0, len(tmp)):
               if tmp[x] == stop_word:
                  tmp[x] = ''
        results.append(" ".join(tmp))
   return results
out_lines = remove_stop_words(in_lines)

Solution

  • in_lines = ['this is go:od', 
                'that example is bad', 
                'amp is a word']
    
    def remove_words(in_list, bad_list):
        out_list = []
        for line in in_list:
            words = ' '.join([word for word in line.split() if not any([phrase in word for phrase in bad_list]) ])
            out_list.append(words)
        return out_list
    
    out_lines = remove_words(in_lines, ['amp', ':'])
    print (out_lines)
    

    Strange as it sounds, the statement

    word for word in line.split() if not any([phrase in word for phrase in bad_list])
    

    does all the hard work here at once. It creates a list of True/False values for each phrase in the "bad" list applied to a single word. The any function condenses this temporary list into a single True/False value again, and if this is False then the word can safely be copied into the line-based output list.

    As an example, the result of removing all words containing an a looks like this:

    remove_words(in_lines, ['a'])
    >>> ['this is go:od', 'is', 'is word']
    

    (It is possible to remove the for line in .. line as well. At that point, readability really starts to suffer, though.)