Search code examples
rdataframemultiple-columnssparse-matrixreshape

How to reshape and then convert a data frame into dgCMatrix?


I have a data frame like following (the rownames are "1", "2", "3"...). Since there are non unique entries in each column, I cannot assign any of them as row names.

gene cell count
a    c1    1
a    c2    1
a    c3    4
b    c1    3
b    c2    1
b    c3    1
f    c1    3
d    c8    9
e    c11   1

Each gene is measured in each cell (means they have a value in count column) but zero counts are not shown (for example gene "a" has zero counts in cells c8 and c11, hence do not appear).

Now I want to reshape/convert the data frame into dgCMatrix with following arrangement

(genes as row names, cells as column names and count values as data points)

   c1  c2  c3  c8  c11 
a  1   1   4   .    .
c  3   1   1   .    . 

where "." corresponds to a zero count.

I tried reshape, reshape2, as.matrix as mentioned in many posts here, but no success.


Solution

  • You convert to long format and set the gene column as rownames first:

    library(Matrix)
    library(dplyr)
    library(tidyr)
    
    mat <- df %>% pivot_wider(id_cols = gene,values_from = count,names_from = cell,
    values_fill = list(count=0)) %>% tibble::column_to_rownames("gene")
    

    Then to sparseMatrix:

    mat = Matrix(as.matrix(mat),sparse=TRUE)
    
        5 x 5 sparse Matrix of class "dgCMatrix"
      c1 c2 c3 c8 c11
    a  1  1  4  .   .
    b  3  1  1  .   .
    f  3  .  .  .   .
    d  .  .  .  9   .
    e  .  .  .  .   1