Search code examples
rlistdataframeintegrate

integrating 2 lists and a data farme in RR


Please could you advise on the R code I could use in order to do the following operation :

  1. I have 2 lists of "genome coordinates" : a list is composed by numbers that represent genome coordinates;

let's say list N:

n1    
n2    
n3    
n4

and a list M:

m1     
m2     
m3     
m4     
m5
  1. and a dataframe C, where for some pairs of coordinates (n,m) from the lists above, we have a numerical intensity;

for example:

n1; m1; 100 
n1; m2; 300

The question would be : what is the most efficient R code I could use in order to integrate the list N, the list M, and the dataframe C, in order to obtain a dataframe with:

  • list N as the columns names
  • list M as the rows names
  • the values in the cells of N * M, corresponding to the numerical values in the data frame C.

A little example would be :

     n1  n2  n3 n4 
m1  100  -   -   - 
m2  300  -   -   - 
m3   -   -   -   - 
m4   -   -   -   - 
m5   -   -   -   -

Solution

  • You can do this using spread in the tidyr package, making sure to keep all values of n and m that appear in the two lists, even if they don't appear in C:

    library(tidyr)
    ## Replicating the data
    listN = list("n1","n2","n3","n4","n5")
    listM = list("m1","m2","m3","m4","m5")
    C = data.frame(n=c("n1","n2","n3"),m=c("m1","m2","m3"),I=c(100,300,400))
       n  m   I
    1 n1 m1 100
    2 n2 m2 300
    3 n3 m3 400
    
    ## Defining factor levels of C using listM and listN, and keeping unused levels when doing the spread
    C$n = factor(C$n,levels=unlist(listN))
    C$m = factor(C$m,levels=unlist(listM))
    res = spread(C,key="n",value="I",drop=F)
    

    This returns:

       m  n1  n2  n3 n4 n5
    1 m1 100  NA  NA NA NA
    2 m2  NA 300  NA NA NA
    3 m3  NA  NA 400 NA NA
    4 m4  NA  NA  NA NA NA
    5 m5  NA  NA  NA NA NA