Search code examples
linuxrawksystemstatistics-bootstrap

Big data read subsamples R


I'm most grateful for your time to read this.

I have a uber size 30GB file of 6 million records and 3000 (mostly categorical data) columns in csv format. I want to bootstrap subsamples for multinomial regression, but it's proving difficult even with my 64GB RAM in my machine and twice that swap file , the process becomes super slow and halts.

I'm thinking about generating subsample indicies in R and feeding them into a system command using sed or awk, but don't know how to do this. If someone knew of a clean way to do this using just R commands, I would be really grateful.

One problem is that I need to pick complete observations of subsamples, that is I need to have all the rows of a particular multinomial observation - they are not the same length from observation to observation. I plan to use glmnet and then some fancy transforms to get an approximation to the multinomial case. One other point is that I don't know how to choose sample size to fit around memory limits.

Appreciate your thoughts greatly.

R.version
platform       x86_64-pc-linux-gnu          
arch           x86_64                       
os             linux-gnu                    
system         x86_64, linux-gnu            
status                                      
major          2                            
minor          15.1                         
year           2012                         
month          06                           
day            22                           
svn rev        59600                        
language       R                            
version.string R version 2.15.1 (2012-06-22)
nickname       Roasted Marshmallows   

Yoda


Solution

  • I think it's an exceedingly terrible idea to use CSV as your data format for such file sizes - why not transform it into a SQLite (or "actual" database) and extract your subsets with SQL queries (using DBI/RSQLite2)?

    You need to import only once, and there is no need to load the entire thing into memory because you can directly import CSV files into sqlite.

    If in general you want to work with datasets that are larger than your memory, you might also want to have a look at bigmemory.