Search code examples
rlarge-datamelt

Melting large dataframes--is there a practical size limit for reshape2's melt?


The problem

I am trying to reshape a survey dataset loaded into a dataframe with about 11k variables and 2k rows into a long(er) format, in order to do some analysis on variables that resulted from looped questions. I have not been able to figure out a way to get around memory allocation errors.

Am I hitting the practical size limit for using melt on dataframes (with about 28MB in CSV-format)? Is there a different way to use melt, or would you use a different function/library for this purpose?

What I've tried so far

I've tried using reshape2's melt function, which should be straightforward but gives a memory error immediately ("cannot allocate vector of size...").

Then I tried breaking up the looped variables into chunks, in order to get many smaller dataframes to melt and then re-constitute. That gives me similar errors (with smaller sizes that cannot be allocated).

For reference, my data has an identifier field ("SbjNum"), a number of variables that only occur once (about 1900), and 99 variables that occur 100 times each (with a prefix of "I_X_I_Y", where X and Y identify loops)--and should be molten into rows corresponding to unique X and Y.

Just using melt naively looked like this:

molten <- melt(data, id.vars = c("SbjNum"))

The chunking I've tried so far looks like this:

#all variable names produced by the loops
loops <- names(data)[grep("I_\\d{1,2}_I_\\d{1,2}",names(data))] 

#setting number of desired chunks
nloopvars <- length(loops)
nchunks <- 100

#make nchunks indexers to subset my data
chunks <- lapply(#indices of loops split into nchunks groups
                 split(1:nloopvars, sort(1:nloopvars%%nchunks)), 
                 function(v){loops[v]}
                 )

#melt little subsets of the data       
molten <- lapply(chunks,
                    function(x){
                      # take only identifier and a subset of loop vars
                      df <- data[c("SbjNum", x)] 
                      # melt the loop vars
                      return(melt(df, id.vars = "SbjNum"))
                      }
                    )

EDIT: after terminating and restarting R as well as clearing my workspace several different ways, approach #2 now works.


Solution

  • After terminating and restarting R, and clearing the workspace several times, my own "chunking" approach is now working (see question)--I recommend trying this in case anyone else has similar issues.

    [There is still a question of up to what size melting makes sense, but I can live without knowing that answer for now.]