I have seen similar questions to mine but I could not find a good solution to mine yet in tidyr
and dplyr
. I have a concatenated column of different text categories where I don't know a complete list of them. It is big data and I could not identify all categories in this column. I need to split them all for each ID
and create a corresponding binary column indicating if that category is lined with the ID
.
this is a sample of the data
df <- data.frame(id=c(1,2,3,4,5),
number=c("a,b,d","e,a","c","","k,t"))
df %>% glimpse()
Rows: 5
Columns: 2
$ id <dbl> 1, 2, 3, 4, 5
$ number <chr> "a,b,d", "e,a", "c", "", "k,t"
my desired data will be like
id a b C d e k t
1 1 1 0 1 0 0 0
2 1 0 1 0 1 0 0
3 0 0 1 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 0 0 0 1 1
Thank you in advance and I hope it is clear.
This approach uses separate_longer_delim
from tidyr to split your text categories and then borrows a convenient dummy encoding function from recipes
for one-hot encoding.
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(dplyr)
library(tidyr)
df <- data.frame(id=c(1,2,3,4,5),
number=c("a,b,d","e,a","c","","k,t"))
df %>%
separate_longer_delim(number, delim = ',') %>%
recipe(~.) %>%
step_dummy(number, one_hot = TRUE) %>%
prep() %>%
bake(new_data = NULL) %>%
group_by(id) %>%
summarize(across(everything(), max))
#> # A tibble: 5 × 9
#> id number_X number_a number_b number_c number_d number_e number_k number_t
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 1 1 0 1 0 0 0
#> 2 2 0 1 0 0 0 1 0 0
#> 3 3 0 0 0 1 0 0 0 0
#> 4 4 1 0 0 0 0 0 0 0
#> 5 5 0 0 0 0 0 0 1 1
Created on 2023-03-16 with reprex v2.0.2