Search code examples
rsplitdata.tablegreplskip

Need to skip different number of rows in R


I am using the following code for processing my data but lately I realized that using skip = 27 (to skip the information stored in my files before the data starts, is not a good option because the number of rows to skip is different in each file. My goal is to read various txt files (not all files have same no.of columns, sequence of columns vary in files and fix the name of column for temperature) which are stored in multiple folders. My data appears as follows:

/* DATA DESCRIPTION:
Algorithm
Checks
    Version 
Parameter(s)
  Date/Time
  Pres
  Wind
  ...
  ...
*/
Date/Time Pres Wind Temp
2022-03-01S01:00:00 278 23 29
2022-03-01S02:00:00 278 23 23
..

I want to read my data from the line next to */ To do it, I tried code given here but I am not able to rewrite it as per my requirement. Could anyone please help me in modifying the code accordingly.


Solution

  • From your example, it looks like the first line you want to read starts with Date/Time.

    From the ?fread documentation, skip can be:

    ... skip="string" searches for "string" in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata).

    Using that, I would think you can do

    dt <- lapply(filelist, fread, skip = "Date/Time")
    

    Since that doesn't work in this case, here's an adaptation where we look for the last comment line and set the skip parameter accordingly, as in the answer you link in your question:

    dt <- lapply(filelist, function(file) {
      lines <- readLines(file)
      comment_end = match("*/", lines)
      fread(file, skip = comment_end)
    })
    

    If your files are very long and you can set an upper boundary on the length of the comment, you could make this much more efficient by setting a max number of lines to read in readLines, e.g., lines <- readLines(file, n = 100) to read at most 100 lines to look for the comment. If you want to be really fancy, you could check the first 100 lines, and if you still don't find then try again reading the whole file.

    This also assumes the last comment line is exactly "*/". If there is the possibility of whitespace or other characters on that line, you could replace match("*/", lines) with grep("*/", lines, fixed = TRUE)[1], which will be a little bit slower.