Search code examples
pythonpandasnlptext-mining

"None of [Float64Index([nan, nan], dtype='float64')] are in the [index]" setting col A value if col B contains string


I have a dataframe (called corpus) with one column (tweet) and 2 rows:

['check, tihs, out, this, bear, love, jumping, on, this, plant']
['i, can, t, bear, the, noise, from, that, power, plant, it, make, me, jump']

I have a list (called vocab) of unique words in the column:

['check',
 'tihs',
 'out',
 'this',
 'bear',
 'love',
 'jumping',
 'on',
 'plant',
 'i',
 'can',
 't',
 'the',
 'noise',
 'from',
 'that',
 'power',
 'it',
 'make',
 'me',
 'jump']

I want to add a new column for each word in vocab. I want all values for the new columns to be zero, except for when the tweet contains the word, in which case I want the value of the word column to be 1.

So I tried running the code below:

for word in vocab:
    corpus[word] = 0
    corpus.loc[corpus["tweet"].str.contains(word), word] = 1

...and the following error was displayed:

"None of [Float64Index([nan, nan], dtype='float64')] are in the [index]"

How can I check to see if the tweet contains the word, and then if so, set the value of the new column for the word to 1?


Solution

  • Your corpus['tweet'] is list type, each is a skeleton. So .str.contains would returns NaN. You may want to do:

    # turn tweets into strings
    corpus["tweet"] = [x[0] for x in corpus['tweet']]
    
    # one-hot-encode
    for word in vocab:
        corpus[word] = 0
        corpus.loc[corpus["tweet"].str.contains(word), word] = 1
    

    But then this may not be what you want, because contains will search for all substrings, e.g. this girl goes to school will returns 1 in both columns is and this.

    Based on your data, you can do:

    corpus["tweet"] = [x[0] for x in corpus['tweet']]
    
    corpus = corpus.join(corpus['tweet'].str.get_dummies(', ')
                             .reindex(vocab, axis=1, fill_value=0)
                        )