Search code examples
rmatrixcontingency

R, how to create a binary relation matrix from a list of strings?


I have a list of files that contain specific genes, and I want to create a binary relation matrix in R that shows the presence of each gene in each file.

For example, here are my files aaa, bbb, ccc, and ddd and the genes associated to them.

aaa=c("HERC1")
bbb=c("MYO9A", "PKHD1L1", "PQLC2", "SLC7A2")
ccc=c("HERC1")
ddd=c("MACC1","PKHD1L1")

I would like to know which command I could use in R to generate a binary relation table like the one in the following image:

enter image description here

where the value 1 means association, and the value 0 means non-association.

How can I do this operation in R?

I tried to use table(aaa,bbb,ccc,ddd) but it did not work. R said:

Error in table(aaa, bbb, ccc, ddd) : all arguments must have the same length

EDIT: Thanks @akrun for your useful reply! I'll take advantage of this question to ask help for another issue, that I'm sure you guys can handle very quickly. For the second part of my analysis, I need to generate another table that where, for each pair of genes, I assign the value 1 if both of them present in the specific file, and 0 other wise. Following the example that I gave earlier, this new table should look like the following one (I transpose it for clarify):

enter image description here

Does anybody know a quick way to obtain this new bigenic table in R, starting from the commands you guys already provided to me? Thanks!


Solution

  • An option would be to get the values of the object identifiers in a named list (mget), stack it to a two column data.frame and get the frequency with table

    table(stack( mget(strrep(letters[1:4], 3)))[2:1])
    #   values
    #ind   HERC1 MACC1 MYO9A PKHD1L1 PQLC2 SLC7A2
    #  aaa     1     0     0       0     0      0
    #  bbb     0     0     1       1     1      1
    #  ccc     1     0     0       0     0      0
    #  ddd     0     1     0       1     0      0
    

    Or an option with tidyverse

    library(tidyverse)
    lst(aaa, bbb, ccc, ddd) %>% 
      enframe %>% 
      unnest %>% 
      count(name, value) %>% 
      spread(value, n, fill = 0)
    # A tibble: 4 x 7
    #  name  HERC1 MACC1 MYO9A PKHD1L1 PQLC2 SLC7A2
    #  <chr> <dbl> <dbl> <dbl>   <dbl> <dbl>  <dbl>
    #1 aaa       1     0     0       0     0      0
    #2 bbb       0     0     1       1     1      1
    #3 ccc       1     0     0       0     0      0
    #4 ddd       0     1     0       1     0      0
    

    In the OP's code

    table(aaa,bbb,ccc,ddd)
    

    the length of the vectors need to be same for table to work. In addition, if we use more than 2 vectors, the frequency table will be multi-dimensional (> 2D). So, we need a framework to have the table applied on two columns instead of multiple objects