Search code examples
rdataframedplyrsummary

R summaries/count combinations in a data frame and display as new data frame/count matrix for column


I have a large data set and need to compare all combinations between columns. The desired output would be a matrix for each column combination.

The data frame to start with could look like Data:

set.seed(1)
Data <- data.frame(
  ID = (1:100),
  A = sample(1:10,10),
  B = sample(1:20,100,replace = T),
  C = sample(1:5,100,replace = T),
  D = sample(1:20,100,replace = T)
  )     
Data

and I want to know how often appears the same combination within two columns. (e.g. how often is a 1 in A in the same row with a 4 in B) for all combinations between column A to D?

I was using:

require(dplyr)

X1 <- ddply(Data,.(A,B),transmute, count=length(ID))

and get a object like:

     A  B count
1    1  3     1
2    1  7     1
3    1  9     2
4    1  9     2
5    1 12     1
6    1 13     1
7    1 14     1
8    1 16     1
9    1 18     1
10   1 20     1
11   2  2     1
12   2  6     1
13   2 10     1
14   2 11     1

But how can I get the count result in a matrix format?

The output for A vs. B cold look like:

    B1  B2  B3  B4  B5  B6
A1  1   1   2   1   1   ...
A2  1   1   2   1   1   
A3  2   1   1   1   1   
A4  2   1   1   1   1   
A5  1   1   2   1   2   
A6  1   1   2   1   2   
A7  1   3   1   1   1   
A8  1   3   1   1   2   
A9  1   3   2   1   2   
A10 1   1   2   1   1


In the best case the result would be a `list`  containing the objects  `AB`  `AC` ...`CD` as matrix. 

Solution

  • You can do this :

    library(tidyverse)
    X2 <-X1 %>% group_by(A,B) %>% 
      summarise(count=max(count)) %>% #use max instead of sum
      ungroup() %>%
      mutate(A=paste0("A",A),B=paste0("B",B)) %>% 
      spread(B,count,fill=0)
    
    X3 <- as.matrix(X2[,2:ncol(X2)])
    rownames(X3) <- as.character(X2$A)