Search code examples
rlarge-data

Reading large uncleaned data efficiently into R


I am trying to do calculations on a large data set (~80 million rows 9 columns) but the problem is that the dataset is uncleaned contains 9 unwanted rows(having different no and type of columns)repeating themselves every 2280 rows of rows of actual data.

Tried different options from basic (read.table) to sqldf , ff , data.frame but unable to read only the actual data and being new to R is added worry.The option working is read.table(file, skip =9 , fill = T) and subsetting it thereafter but that is reading the unwanted rows and taking ages and running out my memory.Tried and researched 100s of hours reading pdf but nothing explains in detail or its difficult to a beginner like me

It looks like:

ITEM: TIMESTEP  
0  
ITEM: NUMBER OF ATOMS  
2280  
ITEM: BOX BOUNDS pp pp pp  
-6.16961 6.16961  
-6.16961 6.16961  
-6.16961 6.16961  
ITEM: ATOMS id mol type x y z ix iy iz   
1 1 1 -0.31373 3.56934 -0.560608 1 -1 6   
2 1 1 0.266159 3.08043 -1.20681 1 -1 6   
3 1 1 1.07006 3.55954 -1.09484 1 -1 6   

I want to read the 9 column values by skipping the first 9 rows every n 2280 rows without running out of memory.

Specifications: Windows 8 x64, 4 GB RAM, 512 GB SSD, Dual Core x64 R


Solution

  • I'd recommend downloading Cygwin64 on Windows. You can do fast processing on large data sets and send chunks to files which can then be processed in R. Here's an example,

    From the shell, remove the first 9 lines and sent the rest to "myFile2.txt", where "myFile.txt" is the original data

    $ tail -n +10 myFile.txt > myFile2.txt 
    

    Then, in R

    > read.table('myFile2.txt')
    #   V1 V2 V3        V4      V5        V6 V7 V8 V9
    # 1  1  1  1 -0.313730 3.56934 -0.560608  1 -1  6
    # 2  2  1  1  0.266159 3.08043 -1.206810  1 -1  6
    # 3  3  1  1  1.070060 3.55954 -1.094840  1 -1  6