I have a data frame like this:
id words
1: 1 capuccin,mok
2: 2 bimboll,ext,sajonjoli
3: 3 burrit,sincr
4: 4 div,tir,mini,doradit
5: 5 pan,multigran,linaz
6: 6 tost,integral
7: 7 pan,blanc
8: 8 sup,pan,bco,ajonjoli
9: 9 wond
10: 10 wond
I'm using the following codes:
bag_of_words <- CountVectorizer$new()
result_df <- cbind(df$id, bag_of_words$fit_transform(df$words))
I'd like to get something like that:
tab_1$id capuccin mok bimboll ext sajonjoli...
1 1 1 1 0 0 0...
2 2 0 0 1 1 1...
3 3 0 0 0 0 0...
4 ... ... ... ... ... ...
But, instead it returns a matrix with the number of occurrencies of every word, it's just returning with the word wond:
df$id wond
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 1
10 10 1
What's wrong with my code?
I got that by using a silimar method than the suggested by tmfmnk in the comments.
tab_1 <- tab_1 %>%
unnest(words) %>%
mutate(words = strsplit(words, ','), occ = 1) %>%
dcast(id ~ unlist(words), fill = 0)
Now it's working as expected.
id ajonjoli bco bimboll ...
1 0 0 0 ...
2 0 0 1 ...
3 0 0 0 ...
... ... ... ...