Search code examples
rread.csv

R read.csv didn't load all rows of .tsv file


A little mystery. I have a .tsv file that contains 58936 rows. I loaded the file into R using this command:

dat <- read.csv("weekly_devdata.tsv", header=FALSE, stringsAsFactors=TRUE, sep="\t")

but nrow(dat) only shows this:

> nrow(dat)
[1] 28485

So I used the sed -n command to write the rows around where it stopped (before, including and after that row) to a new file and was able to load that file into R so I don't think there was any corruption in the file.

Is it an environment issue?

Here's my sessionInfo()

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] tcltk     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sqldf_0.4-10   RSQLite_1.0.0  DBI_0.3.1      gsubfn_0.6-6   proto_0.3-10   scales_0.2.4   plotrix_3.5-11
[8] reshape2_1.4.1 dplyr_0.4.1   

loaded via a namespace (and not attached):
 [1] assertthat_0.1   chron_2.3-45     colorspace_1.2-4 lazyeval_0.1.10  magrittr_1.5     munsell_0.4.2   
 [7] parallel_3.1.2   plyr_1.8.1       Rcpp_0.11.4      rpart_4.1-8      stringr_0.6.2    tools_3.1.2 

Did I run out of memory? Is that why it didn't finish loading?


Solution

  • I had a similar problem lately, and it turned out I had two different problems.

    1 - Not all rows had the right number of tabs. I ended up counting them using awk

    2 - At some points in the file I had quotes that were not closed. This was causing it to skip over all the lines until it found a closing quote.

    I will dig up the awk code I used to investigate and fix these issues and post it.

    Since I am using Windows, I used the awk that came with git bash.

    This counted the number of tabs in a line and printed out those lines that did not have the right number.

      awk -F "\t" 'NF!=6 { print NF-1 ":" $0 } ' Catalog01.csv  
    

    I used something similar to count quotes, and I used tr to fix a lot of it.