Search code examples
pythonnlpdata-cleaning

Python Text Cleaning


I am working on cleansing some text data and have one function that cleans out any non-english/jibberish words. It does a good job, however, there are a few words that are product names that are not recognized as real words so they get eliminated. I am trying to come up with a way to keep certain words in the text

Here is the code I have so far:

    def clean_non_eng(text):
       words = set(nltk.corpus.words.words())
       text = " ".join(w for w in nltk.wordpunct_tokenize(text) if w.lower() in words or not 
       w.isalpha())
       return text

What I am thinking is having some kind of list containing words to keep, and incorporating this into my function to avoid eliminating them

    words_to_keep = ('wordtokeep1', 'wordtokeep2', 'wordtokeep3')

Is there a way I can incorporate another 'or' statement like "or not in words_to_keep" ? I have tried a few different ways but have not been successful so far

as of now, if I call the function it will look something like this

clean_non_eng('hello, this is a test of wordtokeep')

it will return: 'hello, this is a test of'


Solution

  • You shouldn't have an or w not in words_to_keep but rather or w in words_to_keep.
    I think this should solve your issue.

    def clean_non_eng(text):
        words = set(nltk.corpus.words.words())
        text = " ".join(w for w in nltk.wordpunct_tokenize(text) if w.lower() in words or not 
     w.isalpha() or w in words_to_keep)
           return text