Search code examples
stringrdata-import

Read.table to skip lines with errors


I have large .csv file separated by tabs, which has strict structure with colClasses = c("integer", "integer", "numeric") . For some reason, there are number of trash irrelevant character lines, that broke the pattern, that's why I get

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'an integer', got 'ExecutiveProducers'

How can I ask read.table to continue and just to skip this lines? The file is large, so it's troublesome to perform the task by hand. If it's impossible, should I use scan + for-loop ?

Now I just read everything as characters and then delete irrelevant rows and convert columns back to numeric, which I think not very memory-efficient


Solution

  • If your file fits into memory, you could first read the file, remove unwanted lines and then read those using read.csv:

    lines <- readLines("yourfile")
    
    # remove unwanted lines: select only lines that do not contain 
    # characters; assuming you have column titles in the first line,
    # you want to add those back again; hence the c(1, sel)
    sel <- grep("[[:alpha:]]", lines, invert=TRUE)
    lines <- lines[c(1,sel)]
    
    # read data from selected lines
    con <- textConnection(lines)
    data <- read.csv(file=con, [other arguments as normal])