Search code examples
rdplyrsparse-matrixlarge-data

Dataframe is too big for supercomputer


I am trying to create a matrix of donors and recipients, populated with the sum of donations produced in each couple keeping the eventual NAs.

It works well for small datasets (See toy example below) but when I switch to national datasets (3m entries) several problems emerge: besides being painstakingly slow, the creation of the fill df consume all the memory of the (super)computer and I get the error "Error: cannot allocate vector of size 1529.0 Gb"

How should I tackle the problem? Thanks a lot!

library(dplyr)
library(tidyr)
libray(bigmemory)

candidate_id <- c("cand_1","cand_1","cand_1","cand_2","cand_3")
donor_id <- c("don_1","don_1","don_2","don_2","don_3")
donation <- c(1,2,3.5,4,10)
df = data.frame(candidate_id,donor_id,donation)
colnames(df) <- c("candidate_id","donor_id","donation")

fill <- df %>% 
  group_by(df$candidate_id,df$donor_id) %>% 
  summarise(tot_donation=sum(as.numeric(donation))) %>%
  complete(df$candidate_id,df$donor_id)

fill <- unique(fill[ ,1:3])
colnames(fill) <- c("candidate_id","donor_id","tot_donation")

nrow = length(unique(df$candidate_id))
ncol = length(unique(df$donor_id))
row_names = unique(fill$candidate_id)
col_names = unique(fill$donor_id)

x <- big.matrix(nrow, ncol, init=NA,dimnames=list(row_names,col_names))

for (i in 1:nrow){
  for (j in 1:ncol){

    x[i,j] <- fill[which(fill$candidate_id == row_names[i] & 
                       fill$donor_id == col_names[j]), 3]
  }
}

Solution

  • I see you're using unique because your output has duplicated values. Based on this question, you should try the following in order to avoid duplication:

    fill <- df %>% 
        group_by(candidate_id, donor_id) %>% 
        summarise(tot_donation=sum(donation)) %>%
        ungroup %>%
        complete(candidate_id, donor_id)
    

    Can you then try to create your desired output? I think unique can be very resource-heavy, so try to avoid calling it. The tidyr version of what Benjamin suggested should be:

    spread(fill, donor_id, tot_donation)
    

    EDIT: By the way, since you tagged the question with sparse-matrix, you could indeed use sparsity to your advantage:

    library(Matrix)
    library(dplyr)
    
    df <- data.frame(
      candidate_id = c("cand_1","cand_1","cand_1","cand_2","cand_3"),
      donor_id = c("don_1","don_1","don_2","don_2","don_3"),
      donation = c(1,2,3.5,4,10)
    )
    
    summ <- df %>% 
        group_by(candidate_id, donor_id) %>% 
        summarise(tot_donation=sum(donation)) %>%
        ungroup
    
    num_candidates <- nlevels(df$candidate_id)
    num_donors <- nlevels(df$donor_id)
    smat <- Matrix(0, num_candidates, num_donors, sparse = TRUE, dimnames = list(
      levels(df$candidate_id),
      levels(df$donor_id)
    ))
    
    indices <- summ %>%
      select(candidate_id, donor_id) %>%
      mutate_all(unclass) %>%
      as.matrix
    
    smat[indices] <- summ$tot_donation
    smat
    
    3 x 3 sparse Matrix of class "dgCMatrix"
           don_1 don_2 don_3
    cand_1     3   3.5     .
    cand_2     .   4.0     .
    cand_3     .   .      10