I have a dataframe (called corpus
) with one column (tweet
) and 2 rows:
['check, tihs, out, this, bear, love, jumping, on, this, plant']
['i, can, t, bear, the, noise, from, that, power, plant, it, make, me, jump']
I have a list (called vocab
) of unique words in the column:
['check',
'tihs',
'out',
'this',
'bear',
'love',
'jumping',
'on',
'plant',
'i',
'can',
't',
'the',
'noise',
'from',
'that',
'power',
'it',
'make',
'me',
'jump']
I want to add a new column for each word in vocab. I want all values for the new columns to be zero, except for when the tweet
contains the word, in which case I want the value of the word column to be 1.
So I tried running the code below:
for word in vocab:
corpus[word] = 0
corpus.loc[corpus["tweet"].str.contains(word), word] = 1
...and the following error was displayed:
"None of [Float64Index([nan, nan], dtype='float64')] are in the [index]"
How can I check to see if the tweet contains the word, and then if so, set the value of the new column for the word to 1?
Your corpus['tweet']
is list type, each is a skeleton. So .str.contains
would returns NaN
. You may want to do:
# turn tweets into strings
corpus["tweet"] = [x[0] for x in corpus['tweet']]
# one-hot-encode
for word in vocab:
corpus[word] = 0
corpus.loc[corpus["tweet"].str.contains(word), word] = 1
But then this may not be what you want, because contains
will search for all substrings, e.g. this girl goes to school
will returns 1
in both columns is
and this
.
Based on your data, you can do:
corpus["tweet"] = [x[0] for x in corpus['tweet']]
corpus = corpus.join(corpus['tweet'].str.get_dummies(', ')
.reindex(vocab, axis=1, fill_value=0)
)