Search code examples
rdplyrdummy-variable

How to generate a set of dummy variables dependent on values in several other columns with same prefix in R?


I have a data frame I am trying to put in a format I can use for my analysis that looks the following:

ID Name Year K1 K2 ... K50
1 Contract XYZ 2000 transport elephants
2 Agreement ABC 2003 pens music
3 Document 123 2003 elephants
4 Empty Space 2004 music transport

Basically, I have documents with a unique ID and the year they were signed plus a set of variables name K_1 to K_50 that contain unordered keywords. For each of the keywords I would like to generate a dummy variable (i.e. named transport, pens...) that is one if any of the K_1 to K_50 entries contains that particular string.

I have tried the code below which creates the dummy with the name transport if K1 or K2 contain the code but for 50 keywords in 50 columns it is a lot of manual work. Ideally, I would be able to use the string as the variable name and run through all 50 columns to identify all keywords and create dummies that will be one if the keyword is present in either column for a particular ID. I would, however, also be happy with manually creating dummies and being able to look at all 50 keywords without typing it all in.

document_dummies <- mutate(document,
                  transport = case_when(
                    K_1 == "transport" | K_2 == "transport" ~ 1,
                    TRUE ~ NA_real_ 
                  ))          

Solution

  • Does this help you?

    library(tidyverse)
    
    document <- data.frame(
      stringsAsFactors = FALSE,
      ID = c(1L, 2L, 3L, 4L),
      Name = c("Contract XYZ","Agreement ABC",
               "Document 123","Empty Space"),
      Year = c(2000L, 2003L, 2003L, 2004L),
      K1 = c("transport", "pens", "elephants", "music"),
      K2 = c("elephants", "music", NA, NA),
      K50 = c(NA, NA, NA, "transport")
    )
    document %>%
      pivot_longer(starts_with("K")) %>%
      select(-name) %>%
      filter(! is.na(value)) %>%
      mutate(has_property = 1) %>%
      pivot_wider(names_from = value, values_from = has_property)
    #> # A tibble: 4 x 7
    #>      ID Name           Year transport elephants  pens music
    #>   <int> <chr>         <int>     <dbl>     <dbl> <dbl> <dbl>
    #> 1     1 Contract XYZ   2000         1         1    NA    NA
    #> 2     2 Agreement ABC  2003        NA        NA     1     1
    #> 3     3 Document 123   2003        NA         1    NA    NA
    #> 4     4 Empty Space    2004         1        NA    NA     1
    

    Created on 2021-09-21 by the reprex package (v2.0.1)