Search code examples
pandasnlpnltk

Removing Non-English Words from CSV - NLTK


I am relatively new to Python and NLTK and have a hold of Flickr data stored in CSV and want to remove non-english words from the tags column. I keep getting errors saying "expected a String or a byte-like object". I have a feeling it's to do with the fact the tags column is in a Pandas Series datatype currently and not a String. However, none of the related solutions I've seen on Stack have worked when it comes to converting to string.

I have this code:

#converting pandas df to string
filtered_new = df_filtered_english_only.applymap(str)

#check it's converted to string
from pandas.api.types import is_string_dtype
is_string_dtype(filtered_new['tags'])

filtered_new['tags'].dropna(inplace=True)
tokens = filtered_new['tags'].apply(word_tokenize)

#print(tokens)

#remove non-English tags
#initialise corpus of englihs word from nltk
words = set(nltk.corpus.words.words())
" ".join(w for w in nltk.word_tokenize(df_filtered_english_only["tags"]) \
         if w.lower() in words or not w.isalpha())

Any ideas how to resolve this?


Solution

  • Generally: You should give an example of your dataset.

    What is the previous content of the column "tags"? How are tags separated? How is "no tags" expressed and is there a difference between "empty list" and "NAN"?

    I assume tags can contain multiple words so that is important, also when it comes to removing non-english words.

    But for simplicity sake let's assume there are only one-word-tags and they are separated by a whitespace, so that each rows content is a string. Also let's assume that empty rows (no tags) have the default NA value for pandas (numpy.NaN). And since you probably read the file with pandas some values might have been auto-converted to numbers.

    Setup:

    import numpy
    import pandas
    import nltk
    
    df = pandas.DataFrame({"tags": ["bird dog cat xxxyyy", numpy.NaN, "Vogel Hund Katze xxxyyy", 123]})
    >                       tags
      0      bird dog cat xxxyyy
      1                      NaN
      2  Vogel Hund Katze xxxyyy
      3                      123
    

    Drop NA rows and tokenize:

    df.dropna(inplace=True)
    tokens = df["tags"].astype(str).apply(nltk.word_tokenize)
    > 0        [bird, dog, cat, xxxyyy]
      2    [Vogel, Hund, Katze, xxxyyy]
      3                           [123]
      Name: tags, dtype: object
    

    Filter by known words, always allow non-alpha:

    words = set(nltk.corpus.words.words())
    filtered = [" ".join(w for w in row if w.lower() in words or not w.isalpha()) for row in tokens]
    > ['bird dog cat', '', '123']
    

    The main problem in your code probably results from you doing a flat iteration over a nested list (you already tokenized so now each row in the pandas Series is a list). If you modify the iteration to be nested as well as I did in the example the code should run.

    Also you should never do string conversion (be it .astype(str) or any other way) BEFORE removing NAs because then NAs will become something like 'nan' and will not be removed. First drop NA to handle empty cells, then convert to handle other stuff like numbers etc.