Search code examples
rsparse-matrixreshape

Convert a list with various vector length (Compressed Row Storage) into sparseMatrix in R


I use to store my sparse data in MariaDB table with DynamicColumns structure. When I am importing it to R I'm getting the following db_frame data frame:

db_frame <- dbGetQuery(mydb, "SELECT uid, column_json(groups) FROM matrix")
db_frame
  uid                      column_json(groups)
1   8              {"33755311":1,"58534882":1}
2   9                           {"75338985":1}
3  15               {"5445504":1,"58534882":1}
4  16 {"14897324":1,"22522055":1,"68471405":1}
5  20              {"22522055":1,"48940689":1}

My aim is to convert it from the above Compressed Row Storage to the following sparse-format(!) Matrix:

   14897324 22522055 33755311 48940689 5445504 58534882 68471405 75338985
8         .        .        1        .       .        1        .        .
9         .        .        .        .       .        .        .        1
15        .        .        .        .       1        1        .        .
16        1        1        .        .       .        .        1        .
20        .        1        .        1       .        .        .        .

I have done this in two ways but it seems to me both are very inefficient. In real life number of variables (columns) is about 2-3K, and number of observations (rows) ~ 1 million. So my both methods take a very long time to finish the work (days).

Is there other, more elegant way to do this conversion? Thank you in advance!

UPD: There are links to two dataframes in R format:

  1. db_frame - data frame from example 5 observations
  2. db_frame_10K - real data frame with 10K observations for speed testing

UPD2: SPEED COMPARISON TABLE for convertion data frame with 10K observation
Core i3 2.93 Ghz

Method1 My            162-188 sec
Method2 My             94-102 sec
Method3 @amatsuo_net    47-57 sec (but not exactly required format)
Method4 @amatsuo_net+My     6 sec 

For your information my two methods:

The first step is to convert db_frame data frame to more friendly list to have a possibility to extract variables names for each observation from it.

library("rjson")
var_list <- lapply(db_frame[,-1],fromJSON)
var_list_names <- lapply(var_list,names)
var_list_names
[[1]]
[1] "33755311" "58534882"

[[2]]
[1] "75338985"

[[3]]
[1] "5445504"  "58534882"

[[4]]
[1] "14897324" "22522055" "68471405"

[[5]]
[1] "22522055" "48940689"

Lists of variables and observations names:

groups <- sort(unique(unlist(var_list_names)))
groups
[1] "14897324" "22522055" "33755311" "48940689" "5445504"  "58534882" "68471405" "75338985"
uids <- db_frame$uid
uids
[1] "8"  "9"  "15" "16" "20"

------------- Method 1 using "for i" cycle

row_number = length(uids)
col_number = length(groups)

# creating empty sparse matrix M1
M1 <- sparseMatrix(dims = c(row_number,col_number), i={}, j={}, x=1)
rownames(M1) <- uids
colnames(M1) <- groups

# filling M1
for (i in 1:row_number) {
      M1[i,var_list_names[[i]]] <-1
}
M1

-------------- Method 2 using "reshape2"

library("reshape2")
long <- melt(var_list)
long
   value       L2 L1
1      1 33755311  1
2      1 58534882  1
3      1 75338985  2
4      1  5445504  3
5      1 58534882  3
6      1 14897324  4
7      1 22522055  4
8      1 68471405  4
9      1 22522055  5
10     1 48940689  5

i=long$L1
j=match(long[,"L2"],groups)

M2 <-sparseMatrix(i=i, j=j, x=1)
rownames(M2) <- uids
colnames(M2) <- groups
M2

Solution

  • Thanks to @amatsuo_net who point me to rbindlist function from data.table package.
    I have slightly simplified his code and added conversion to sparse format.
    Conversion time on test 10K observation is impressive 6 sec!

    -------------- Method 4 using "rbindlist"

    library(RMySQL)
    library(Matrix)
    library(rjson)
    library(data.table)
    library(magrittr)
    
    df <- dbGetQuery(mydb, "SELECT uid, column_json(groups) FROM matrix")    
    
    # "rbindlist" does all the work    
    M3 <- lapply(df[,-1],fromJSON) %>% rbindlist(fill=TRUE)
    
    # replace NA with 0 (required for sparsematrix type)
    M3[is.na(M3)] <- 0 
    # converting to sparsematrix type
    M3 <- as(as.matrix(M3), "sparseMatrix")
    
    # make some order :)
    M3 <- M3[, order(as.integer(colnames(M3)))]
    row.names(M3) <- df$uid