Search code examples
rdplyrconcatenationtidyr

spliting a concatenated text column into binary columns in R (dplyr) when a list of all categories is not known


I have seen similar questions to mine but I could not find a good solution to mine yet in tidyr and dplyr . I have a concatenated column of different text categories where I don't know a complete list of them. It is big data and I could not identify all categories in this column. I need to split them all for each ID and create a corresponding binary column indicating if that category is lined with the ID . this is a sample of the data

df <- data.frame(id=c(1,2,3,4,5),
                   number=c("a,b,d","e,a","c","","k,t"))
df %>% glimpse() 
Rows: 5
Columns: 2
$ id     <dbl> 1, 2, 3, 4, 5
$ number <chr> "a,b,d", "e,a", "c", "", "k,t"

my desired data will be like

id a b C d e k t
1  1 1 0 1 0 0 0
2  1 0 1 0 1 0 0
3  0 0 1 0 0 0 0 
4  0 0 0 0 0 0 0 
5  0 0 0 0 0 1 1

Thank you in advance and I hope it is clear.


Solution

  • This approach uses separate_longer_delim from tidyr to split your text categories and then borrows a convenient dummy encoding function from recipes for one-hot encoding.

    library(recipes)
    #> Loading required package: dplyr
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    #> 
    #> Attaching package: 'recipes'
    #> The following object is masked from 'package:stats':
    #> 
    #>     step
    library(dplyr)
    library(tidyr)
    
    df <- data.frame(id=c(1,2,3,4,5),
                     number=c("a,b,d","e,a","c","","k,t"))
    
    df %>%
      separate_longer_delim(number, delim = ',') %>%
      recipe(~.) %>%
      step_dummy(number, one_hot = TRUE) %>%
      prep() %>%
      bake(new_data = NULL) %>%
      group_by(id) %>%
      summarize(across(everything(), max))
    #> # A tibble: 5 × 9
    #>      id number_X number_a number_b number_c number_d number_e number_k number_t
    #>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
    #> 1     1        0        1        1        0        1        0        0        0
    #> 2     2        0        1        0        0        0        1        0        0
    #> 3     3        0        0        0        1        0        0        0        0
    #> 4     4        1        0        0        0        0        0        0        0
    #> 5     5        0        0        0        0        0        0        1        1
    

    Created on 2023-03-16 with reprex v2.0.2