Search code examples
c++rrcpp

Amount of time to count lines for sequence of files using Rcpp higher than expected


I have a process for cleaning files and then saving them into correctly formatted files that arrow can read at a later time. The files are tsv format and have around 30 columns, mixed data types -- mostly character, but a couple of numeric columns. There is a significant number of files that have only a header and no data content. I decided that prior to reading in the files for cleaning, I would check to ensure that they have content, rather than reading in the file as a data frame, and then checking for content. So, essentially, I just wanted to check that the number of lines in the file was >= 2. I am using a simple C++ method that I pulled into R using Rcpp:

Rcpp::cppFunction(
  "
  #include <fstream>
  bool more_than_one_line(std::string filepath) {
    std::ifstream input_file;
    input_file.open(filepath);
    std::string unused;
    int numLines = 0;
    while(std::getline(input_file, unused)) {
      ++numLines;
      if (numLines >= 2) {
        return true;
      }
    }
    return false;
  }
  "
)

I take some timing measurements like so:

v <- vector(mode="numeric", length=1000)
ii = 0
for (file in listOfFiles[1:1000]) {
  print(ii)
  ii = ii + 1
  t0 <- Sys.time()
  more_than_one_line(file);
  v[ii] <- difftime(Sys.time(), t0)
}

When I run this code, it takes about 1 second per file, if the files have never been read before; it's much, much faster if I run the code over files that have previously been processed. Yet, according to this SO answer, the fastest time for counting the lines in a 12M row file is 0.1 seconds (my files are max 500k rows), and the SO user who recommended that fastest strategy (which used Linux wc) also recommended that using C++ would be quite fast. I thought the C++ method I wrote would be equally as fast as the wc method, if not faster, at least due to the fact that I am only reading, at most, the first two lines.

Am I thinking about this wrong? Is my approach wrong?


Solution

  • In my answer to the question linked to by the OP I mention package fpeek, function peek_count_lines. This is a fast function coded in C++. With a directory of 82 CSV files ranging from 4.8K lines to 108K lines on my computer (1 year old 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz running Windows 11) it gives me an average 0.03 secs per file to get the number of lines.
    Then you can use these values and subset on the condition flsize >= 2.

    #path <- "my/path/omited"
    fls <- list.files(path, full.names = TRUE)
    # how many files
    length(fls)
    #> [1] 82
    
    # range of file sizes in MB
    range(file.size(fls)) / 1024L / 1024L
    #> [1]  0.766923 19.999812
    # total file size in MB
    sum(file.size(fls)) / 1024L / 1024L
    #> [1] 485.6675
    
    # this is the main problem
    t0 <- system.time(
      flsize <- sapply(fls, fpeek::peek_count_lines)
    )
    
    # the files have from 4.8K to 108K lines
    range(flsize)
    #> [1]   4882 108503
    # how many files have more than just the header line
    sum(flsize >= 2)
    #> [1] 82
    
    # timings
    t0
    #>    user  system elapsed 
    #>    0.28    1.12    2.30
    # average timings per file
    t0/length(flsize)
    #>        user      system     elapsed 
    #> 0.003414634 0.013658537 0.028048780
    

    Created on 2023-02-04 with reprex v2.0.2