I am trying to do calculations on a large data set (~80 million rows 9 columns) but the problem is that the dataset is uncleaned contains 9 unwanted rows(having different no and type of columns)repeating themselves every 2280 rows of rows of actual data.
Tried different options from basic (read.table) to sqldf , ff , data.frame but unable to read only the actual data and being new to R is added worry.The option working is read.table(file, skip =9 , fill = T) and subsetting it thereafter but that is reading the unwanted rows and taking ages and running out my memory.Tried and researched 100s of hours reading pdf but nothing explains in detail or its difficult to a beginner like me
It looks like:
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
2280
ITEM: BOX BOUNDS pp pp pp
-6.16961 6.16961
-6.16961 6.16961
-6.16961 6.16961
ITEM: ATOMS id mol type x y z ix iy iz
1 1 1 -0.31373 3.56934 -0.560608 1 -1 6
2 1 1 0.266159 3.08043 -1.20681 1 -1 6
3 1 1 1.07006 3.55954 -1.09484 1 -1 6
I want to read the 9 column values by skipping the first 9 rows every n 2280 rows without running out of memory.
Specifications: Windows 8 x64, 4 GB RAM, 512 GB SSD, Dual Core x64 R
I'd recommend downloading Cygwin64 on Windows. You can do fast processing on large data sets and send chunks to files which can then be processed in R. Here's an example,
From the shell, remove the first 9 lines and sent the rest to "myFile2.txt"
, where "myFile.txt"
is the original data
$ tail -n +10 myFile.txt > myFile2.txt
Then, in R
> read.table('myFile2.txt')
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# 1 1 1 1 -0.313730 3.56934 -0.560608 1 -1 6
# 2 2 1 1 0.266159 3.08043 -1.206810 1 -1 6
# 3 3 1 1 1.070060 3.55954 -1.094840 1 -1 6