Search code examples
rcsvabort

R aborts while importing with read.csv()


I have a matrix of approximately ~240 mio rows and 3 columns that I need to "import and work with" in R. I do not have access to a server right now, so I got the idea of importing a submatrix, working with it, and then discarding it from the environment and repeating the procedure until the whole matrix is done (for what I have to do it works just as well). In particular, as the number of rows is multiple of 11, I decided to work with 11 submatrices. Therefore, what I have been doing is the following:

  1. Define Nstep as the number of rows to import every time (total nr of rows/11, it is about 22 Mio.)
  2. mat.n <- read.csv2(filepath, nrows=Nstep, skip =(n-1)*Nstep-1, header=T)
  3. do what I have to do
  4. discard the matrix
  5. repeat (manually, to make sure every step is successful) the above for n = 1, 2, ..., 11.

After finishing importing the 6th block, I realized I was leaving header=T so I set header=F. Since then, every time I tried importing the file the R session aborted. *EDIT: setting back header=T is not working either.

I thought it depended on the header=F thing, but it was not the case. Therefore I guess it has to do with Nstep, or with the first row of the submatrix. I tried doing some experiments: - if I re-import the first block, it works - if I import the 5th block, first ten rows only, it takes ages (I let it start about 20 mins ago and is not finished yet, even though it's just 10 rows) - if I repeat it on R instead as on R Studio, I have the same issues.

Any idea about why this is happening? I am working with R version 3.1.1 on R Studio Version 0.98.1028, Platform: x86_64-w64-mingw32/x64 (64-bit).


Solution

  • There are better alternatives to read.* functions for big data files. Specifically data.table package's fread() function or the readr package which has slightly safer alternatives to fread (albeit a bit slower than fread but still very fast compared to original read.* functions).

    At the end of the day you will still be limited by the size of your computer's memory. There are work arounds that too, but I think for your case fread() or readr will do just fine.