Search code examples
rdataframetextdata-ingestion

How can I read a table in a loosely structured text file into a data frame in R?


Take a look at the "Estimated Global Trend daily values" file on this NOAA web page. It is a .txt file with something like 50 header lines (identified with leading #s) followed by several thousand lines of tabular data. The link to download the file is embedded in the code below.

How can I read this file so that I end up with a data frame (or tibble) with the appropriate column names and data?

All the text-to-data functions I know get stymied by those header lines. Here's what I just tried, riffing off of this SO Q&A. My thought was to read the file into a list of lines, then drop the lines that start with # from the list, then do.call(rbind, ...) the rest. The downloading part at the top works fine, but when I run the function, I'm getting back an empty list.

temp <- paste0(tempfile(), ".txt")
download.file("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_trend_gl.txt",
              destfile = temp, mode = "wb")

processFile = function(filepath) {
  dat_list <- list()
  con = file(filepath, "r")
  while ( TRUE ) {
    line = readLines(con, n = 1)
    if ( length(line) == 0 ) {
      break
    }
    append(dat_list, line)
  }

  close(con)

  return(dat_list)

}

dat_list <- processFile(temp)

Solution

  • Here's a possible alternative

    processFile = function(filepath, header=TRUE, ...) {
    
      lines <- readLines(filepath)
      comments <- which(grepl("^#", lines))
      header_row <- gsub("^#","",lines[tail(comments,1)])
      data <- read.table(text=c(header_row, lines[-comments]), header=header, ...)
    
      return(data)
    
    }
    
    processFile(temp)
    

    The idea is that we read in all the lines, find the ones that start with "#" and ignore them except for the last one which will be used as the header. We remove the "#" from the header (otherwise it's usually treated as a comment) and then pass it off to read.table to parse the data.