I have a Series with index named "Article Number" and a column named "Text". Series looks like this:
Article ID | Text
Article#1 This is a beautiful day
Article#2 I love you
Article#3 This is too late
Article#4 Love you back
Article#5 This is a lovely day
distinct_words = ['This', 'beautiful', 'day']
I'd like to create a dictionary which its key is the distinct word and its value is list of article it was in. so for example above it would be:
vocabulary = {"This":[Article#1, Article#3], "beautiful":[Article#1], "day":[Article#1, Article#5}
what i have written is:
vocabulary ={}
for word in distinct_words:
filt = df.str.findall(word)
vocabulary[word] = df.loc[filt].index
However i get this error:
TypeError: unhashable type: 'list'
Can someone help me with this problem? I have tried a nested loop but since my original file is large it takes minutes, but the problem needs to be computed under 40 secs. I was told using re module would be great.
Series.str.findall()
returns matched value as list. You can use Series.str.contains()
to find if column contains value.
for word in distinct_words:
vocabulary[word] = df[df['Text'].str.contains(word)].index.tolist()