Search code examples
rdataframemultiple-columns

Check for common items in other rows


I have more of a general question. I have a dataframe like below, consisting of ids that could have checked multiple items (at least 3, at max 6).

id  item_1      item_2      item_3      item_4      item_5      item_6
1   13103802    13060661    13339404    12896842    13308823    NA
2   448361      497992      13103802*   13002842    NA          NA
3   13031560    13103802*   13268709    2139908     1954965     12930979    
4   13060661*   13339404*   446881      13406902    NA          NA
5   12980231    12980231    12980231    NA          NA          NA
6   12896842*   13339404*   12717215    444032      13308823*   NA  
7   2098716     449342      13339070    12993196    2649922     NA  
8   2678151     12700906    12903744    2623298     12736032    349511  
9   2501765     2534504     2629353     NA          NA          NA  
10  12955428    12766447    12944593    NA          NA          NA

Now for each id I would like to count how many other ids share similar 1 to 6 items. So in the end I would like to add 6 more columns with counts of rows that share 1 item, 2 items etc..

So, based on data above, for the first row, column '1 item' would have value of 4, because it shares at least one item with rows 2,3,4 and 6, column '2 items' would have value of 2 because it shares at least 2 items with rows 4 and 6 and column '3 items' would have value of 1 because it shares at least 3 items with row 6, and so on (I marked values shared with first row in other rows with * so it is more visible I hope).

Not sure about the approach, can anyone help?


Solution

  • A bit messy and convoluted with tidyverse, but something you can try out. If this is close to what you need let me know.

    library(tidyverse)
    
    df %>%
      pivot_longer(-id, names_pattern = "(\\d+)$") %>%
      filter(!is.na(value)) %>%
      mutate(n = n_distinct(id), ids = list(unique(id)), .by = value) %>%
      unnest(ids) %>%
      filter(id != ids) %>%
      reframe(freq = as.numeric(table(ids)), .by = id) %>%
      right_join(expand_grid(id = df$id, col = 1:(ncol(df)-1)), by = "id", multiple = "all") %>%
      replace_na(list(freq = 0)) %>%
      reframe(value = sum(freq >= col), .by = c("id", "col")) %>%
      pivot_wider(id_cols = id, names_from = col, values_from = value, names_prefix = "count") %>%
      arrange(id)
    

    Output

          id count1 count2 count3 count4 count5 count6
       <int>  <int>  <int>  <int>  <int>  <int>  <int>
     1     1      4      2      1      0      0      0
     2     2      2      0      0      0      0      0
     3     3      2      0      0      0      0      0
     4     4      2      1      0      0      0      0
     5     5      0      0      0      0      0      0
     6     6      2      1      1      0      0      0
     7     7      0      0      0      0      0      0
     8     8      0      0      0      0      0      0
     9     9      0      0      0      0      0      0
    10    10      0      0      0      0      0      0
    

    Data

    df <- structure(list(id = 1:10, item_1 = c(13103802L, 448361L, 13031560L, 
    13060661L, 12980231L, 12896842L, 2098716L, 2678151L, 2501765L, 
    12955428L), item_2 = c(13060661L, 497992L, 13103802L, 13339404L, 
    12980231L, 13339404L, 449342L, 12700906L, 2534504L, 12766447L
    ), item_3 = c(13339404L, 13103802L, 13268709L, 446881L, 12980231L, 
    12717215L, 13339070L, 12903744L, 2629353L, 12944593L), item_4 = c(12896842L, 
    13002842L, 2139908L, 13406902L, NA, 444032L, 12993196L, 2623298L, 
    NA, NA), item_5 = c(13308823L, NA, 1954965L, NA, NA, 13308823L, 
    2649922L, 12736032L, NA, NA), item_6 = c(NA, NA, 12930979L, NA, 
    NA, NA, NA, 349511L, NA, NA)), class = "data.frame", row.names = c(NA, 
    -10L))