Search code examples
rdplyrcorrespondence

Correspondence between character columns


I have a dataframe with five character columns. Each column has a limited number of values (categorical data). In the dataset, each value in one column occurs a variable number of times with a the other values in the other columns.

Here is an example data set:

d<- structure(list(ID = c(17, 12, 12, 17, 17, 12, 12, 17, 31, 13), 
    card = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3), curf = c("c11", "c11", 
    "c11", "c11", "c12", "c12", "c12", "c12", "c08", "c08"), 
    mas = c("m2_indo", "m2_indo", "m2_indo", "m2_indo", "m2_indo", 
    "m2_indo", "m2_indo", "m2_indo", "m3_every", "m3_every"), 
    vac = c("v_100", "v_100", "v_100", "v_100", "v_200", "v_200", 
    "v_200", "v_200", "v_100", "v_100"), scho = c("s_nope", "s_nope", 
    "s_nope", "s_nope", "s_50", "s_50", "s_50", "s_50", "s_nope", 
    "s_nope"), alco = c("a3_nsol", "a3_nsol", "a3_nsol", "a3_nsol", 
    "a2_thu", "a2_thu", "a2_thu", "a2_thu", "a1_sat", "a1_sat"
    )), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))

      ID  card curf  mas      vac   scho   alco   
   <dbl> <dbl> <chr> <chr>    <chr> <chr>  <chr>  
 1    17     1 c11   m2_indo  v_100 s_nope a3_nsol
 2    12     1 c11   m2_indo  v_100 s_nope a3_nsol
 3    12     1 c11   m2_indo  v_100 s_nope a3_nsol
 4    17     1 c11   m2_indo  v_100 s_nope a3_nsol
 5    17     2 c12   m2_indo  v_200 s_50   a2_thu 
 6    12     2 c12   m2_indo  v_200 s_50   a2_thu 
 7    12     2 c12   m2_indo  v_200 s_50   a2_thu 
 8    17     2 c12   m2_indo  v_200 s_50   a2_thu 
 9    31     3 c08   m3_every v_100 s_nope a1_sat 
10    13     3 c08   m3_every v_100 s_nope a1_sat 

I want to compute the number of times each possible value of column occurs at the same time than values taken in other columns.

The target is a table such as:

col1  col2     No_of_Occurence
c11   m2_indo   xxx
c12   m2_indo   xxx
c08   m2_indo   xxx 
c11   v_100     xxx
c12   v_100     xxx
c08   v_100     xxx
...
s_50   a2_thu   xxx

I do not see any sound strategy to compute this?


Solution

  • Here's a way to do it for all character columns in one go, without needing to know the names of the columns in advance.

    long1 <- d %>% 
      mutate(Row=row_number()) %>% 
      pivot_longer(cols=where(is.character), names_to="Col1", values_to="Value1")
    long2 <- d %>% 
      mutate(Row=row_number()) %>% 
      pivot_longer(cols=where(is.character), names_to="Col2", values_to="Value2")
    
    long1 %>% 
      left_join(long2, by="Row") %>% 
      filter(Col1 != Col2) %>% group_by(Value1, Value2) %>% 
      summarise(N=n(), .groups="drop")
    # A tibble: 58 x 3
       Value1  Value2       N
     * <chr>   <chr>    <int>
     1 a1_sat  c08          2
     2 a1_sat  m3_every     2
     3 a1_sat  s_nope       2
     4 a1_sat  v_100        2
     5 a2_thu  c12          4
     6 a2_thu  m2_indo      4
     7 a2_thu  s_50         4
     8 a2_thu  v_200        4
     9 a3_nsol c11          4
    10 a3_nsol m2_indo      4
    # … with 48 more rows