Search code examples
rdplyrtidyrreshape2

Create a binary wide table from a long table (like tidyr::spread() )


As input for a tree model I created an analysis table in SQL. Now I want to transfer it to R because the model which has this table as input is also running in R. One of the SQL-steps I'm not able to transform into R.

The analysis table has the following form:

df <- data.frame(
  pseudonym = c("a", "a", "a", "b", "c", "c"),
  var1 = c(1,1,0,1,1,0),
  var2 = c(1,0,0,0,0,1),
  var3 = c(0,0,0,0,0,1))

> df
  pseudonym var1 var2 var3
1         a    1    1    0
2         a    1    0    0
3         a    0    0    0
4         b    1    0    0
5         c    1    0    0
6         c    0    1    1

In the next step I need the disctinct rows for pseudonym with keeping the information (1) from the other columns var1, var2, var3. (In SQL this is created through max(case when...then 1 else 0 end) as var1 )

Thus the result df2 created from df1 should be

df2 <- data.frame(
  pseudonym = c("a", "b", "c"),
  var1 = c(1,1,1),
  var2 = c(1,0,1),
  var3 = c(0,0,1))

> df2
  pseudonym var1 var2 var3
1         a    1    1    0
2         b    1    0    0
3         c    1    1    1

It would be very helpful if somebody has an idea.


Solution

  • Here's one way:

    library(dplyr)
    library(tidyr)
    
    df <- data.frame(
      pseudonym = c("a", "a", "a", "b", "c", "c"),
      var1 = c(1,1,0,1,1,0),
      var2 = c(1,0,0,0,0,1),
      var3 = c(0,0,0,0,0,1))
    
    df %>% 
      pivot_longer(cols = var1:var3) %>% 
      group_by(pseudonym, name) %>% 
      filter(max(value) == value) %>% 
      ungroup() %>% 
      distinct() %>% 
      pivot_wider(names_from = name, values_from = value)
    
    #># A tibble: 3 x 4
    #>  pseudonym  var1  var2  var3
    #>  <fct>     <dbl> <dbl> <dbl>
    #>1 a             1     1     0
    #>2 b             1     0     0
    #>3 c             1     1     1