Search code examples
rmemory-managementff

correlation matrix using large data sets in R when ff matrix memory allocation is not enough


I have a simple analysis to be done. I just need to calculate the correlation of the columns (or rows ,if transposed). Simple enough? I am unable to get the results for the whole week and I have looked through most of the solutions here.

My laptop has a 4GB RAM. I do have access to a server with 32 nodes. My data cannot be loaded here as it is huge (411k columns and 100 rows). If you need any other information or maybe part of the data I can try to put it up here, but the problem can be easily explained without really having to see the data. I simply need to get a correlation matrix of size 411k X 411k which means I need to compute the correlation among the rows of my data.

Concepts I have tried to code: (all of them in some way give me memory issues or run forever)

  1. The most simple way, one row against all, write the result out using append.T. (Runs forever)
  2. biCorPar.r by bobthecat (https://gist.github.com/bobthecat/5024079), splitting the data into blocks and using ff matrix. (unable to allocate memory to assign the corMAT matrix using ff() in my server)
  3. split the data into sets (every 10000 continuous rows will be a set) and do correlation of each set against the other (same logic as bigcorPar) but I am unable to find a way to store them all together finally to generate the final 411kX411k matrix.
  4. I am attempting this now, bigcorPar.r on 10000 rows against 411k (so 10000 is divided into blocks) and save the results in separate csv files.
  5. I am also attempting to run every 1000 vs 411k in one node in my server and today is my 3rd day and I am still on row 71.

I am not an R pro so I could attempt only this much. Either my codes run forever or I do not have enough memory to store the results. Are there any more efficient ways to tackle this issue?

Thanks for all your comments and help.


Solution

  • I'm familiar with this problem myself in the context of genetic research.

    If you are interested only in the significant correlations, you may find my package MatrixEQTL useful (available on CRAN, more info here: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/ ).

    If you want to keep all correlations, I'd like to first warn you that in the binary format (economical compared to text) it would take 411,000 x 411,000 x 8 bytes = 1.3 TB. If this what you want and you are OK with the storage required for that, I can provide my code for such calculations and storage.