Search code examples
rtime-serieszoonaread.table

Read irregular format text file mixed of text with/without space and numbers


Sample data look like this (fyi I have hundreds of files like this). The tricky part is the "NO RECORD" in the file. I haven't been trying hours to get it into R without any success

BEGIN DATA
RIM
DATE           AF         QD         QU 
09/30/1920      NO RECORD       370.00  NO RECORD   
10/01/1920      NO RECORD       391.00     391.00 
10/02/1920      NO RECORD       496.00    MISSING 
10/03/1920      NO RECORD       660.00    MISSING 
10/04/1920      NO RECORD       881.00    MISSING 
10/05/1920      NO RECORD       660.00    MISSING 
10/06/1920      NO RECORD       515.00    -9999 
10/07/1920      NO RECORD       443.00    NO RECORD 
10/08/1920      NO RECORD       443.00    MISSING 
10/09/1920      NO RECORD       443.00    443.00 
10/10/1920      NO RECORD       443.00    MISSING

This is my latest R code

library(zoo)

# function to read data
obsRead <- function(path2file, filename, number_line_skip, header_or_not) {
  tmpName <- paste(path2file, filename, sep="")
  tmpData <- read.zoo(tmpName,
                   tz='', stringsAsFactors = FALSE, strip.white = TRUE,
                   header=header_or_not, skip=number_line_skip, 
                   na.strings = c("NA", "N/A", "MISSING", "NO RECORD", "-9999"), # tell zoo what NA values look like
  qName <- c("AF", "QD", "QU")
  names(tmpData) <- qName
  index(tmpData) <- as.Date(index(tmpData)) # Convert index from POSIXct to Date
  str(tmpData)
  return(tmpData)  
}

dataDir = "path/to/file/"
dataFile <- "sampleData.txt"
nLineSkip <- 3
header_or_not <- FALSE

Q_obs <- obsRead(dataDir, dataFile, nLineSkip, header_or_not)

And the error I got from R

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 2 did not have 6 elements

Any suggestion would be appreciated! Thanks!

Edit @MichaelChirico posted another solution using data.table package here

 fread(text=x, sep = " ", header=TRUE, fill=length(unlist(strsplit(x[1], " "))), na.strings=strrep("*", 1:6))

Solution

  • Provided NO RECORD is consistent, this should give you a start:

    tmp <- readLines("sample.dat")
    
    # substitute 'NO RECORD' for 'NORECORD' and use read.table()
    # to process the collapsed vector
    
    tmp.collapse <- paste(gsub("NO RECORD", "NORECORD", 
                          tmp[4:length(tmp)]), sep="", collapse="\n")
    
    # get the column names from the third row and use them in the data table
    
    read.table(textConnection(tmp.collapse), 
               header=FALSE, stringsAsFactors=FALSE, 
               col.names=unlist(strsplit(tmp[3], "\ +")))
    
    ##          DATE       AF  QD       QU
    ## 1  09/30/1920 NORECORD 370 NORECORD
    ## 2  10/01/1920 NORECORD 391   391.00
    ## 3  10/02/1920 NORECORD 496  MISSING
    ## 4  10/03/1920 NORECORD 660  MISSING
    ## 5  10/04/1920 NORECORD 881  MISSING
    ## 6  10/05/1920 NORECORD 660  MISSING
    ## 7  10/06/1920 NORECORD 515    -9999
    ## 8  10/07/1920 NORECORD 443 NORECORD
    ## 9  10/08/1920 NORECORD 443  MISSING
    ## 10 10/09/1920 NORECORD 443   443.00
    ## 11 10/10/1920 NORECORD 443  MISSING