I'm trying to create multiple dummy variables, based on one column called 'Tags' within my df (14 rows, 2 columns, Score and Tags. My problem is that in each cell there can be any number of chr values (up to about 30 values).
When I ask for:
str(df$Tags)
R returns:
chr [1:14] "\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"lactose intolera"| __truncated__ ...
And when I ask for:
df$Tags[1]
R returns:
[1] "\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"lactose intolerantie\", \"met familie\", \"met vrienden\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", chinees, gastronomisch, glutenvrij, kindvriendelijk, romantisch, traditioneel, trendy, verjaardag, zakelijk"
It seems that the values within the first cell are not formatted the same (the values between comma's)
So what I wish for, is to create a dummy variable for each possible value that occurs within each cells. So the first new dummy should be called "biologische gerechten" (or any alike) and should show for each case whether the corresponding value is present (1) in the column 'Tags' or not (0).
i tried several things with 'dplyr' like:
df = mutate(df, biologisch = ifelse(Tags == "biologische gerechten", 1, 0))
R does create a new column 'biologisch', but it only contains zero's. Is there another way to separate all values and then create dummy variables for all possible values? Hope someone can help me, thank you!
Here's one solution:
# make some toy data to test
set.seed(1)
df <- data.frame(Score = rnorm(10),
Tags = replicate(10, paste(sample(LETTERS, 5), collapse = ", ")),
stringsAsFactors = FALSE)
# load stringr, which we'll use to trim whitespace from the split-up tags
library(stringr)
# use strsplit to break your jumbles of tags into separate elements, with a
# list for each position in the original vector. i've split on commas here,
# but you'll probably want to split on slashes or slashes and quotation marks.
t <- strsplit(df$Tags, split = ",")
# get a vector of the unique elements of those lists. you may need to use str_trim
# or something like it to cut leading and trailing whitespace. you might also
# need to use stringr's `str_subset` and a regular expression to cut the result
# down to, say, only alphanumeric strings. without a reproducible example, though,
# i can't do that for your specific case here.
tags <- unique(str_trim(unlist(t)))
# now, use `sapply` and `grepl` to look for each element of `tags` in each list;
# use `any` to summarize the results;
# use `+` to convert those summaries to binary;
# use `lapply` to iterate that process over all elements of `tags`;
# use `Reduce(cbind, ...)` to collapse the results into a table; and
# use `as.data.frame` to turn that table into a df.
df2 <- as.data.frame(Reduce(cbind, lapply(tags, function(i) sapply(t, function(j) +(any(grepl(i, j), na.rm = TRUE))))))
# assign the tags as column names
names(df2) <- tags
Voila:
> df2
Y F P C Z K A J U H M O L E S R T Q V B I X G
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0
6 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0
7 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0
8 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0
9 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0
10 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1