I have a process for cleaning files and then saving them into correctly formatted files that arrow
can read at a later time. The files are tsv
format and have around 30 columns, mixed data types -- mostly character
, but a couple of numeric
columns. There is a significant number of files that have only a header and no data content. I decided that prior to reading in the files for cleaning, I would check to ensure that they have content, rather than reading in the file as a data frame, and then checking for content. So, essentially, I just wanted to check that the number of lines in the file was >= 2
. I am using a simple C++ method that I pulled into R
using Rcpp
:
Rcpp::cppFunction(
"
#include <fstream>
bool more_than_one_line(std::string filepath) {
std::ifstream input_file;
input_file.open(filepath);
std::string unused;
int numLines = 0;
while(std::getline(input_file, unused)) {
++numLines;
if (numLines >= 2) {
return true;
}
}
return false;
}
"
)
I take some timing measurements like so:
v <- vector(mode="numeric", length=1000)
ii = 0
for (file in listOfFiles[1:1000]) {
print(ii)
ii = ii + 1
t0 <- Sys.time()
more_than_one_line(file);
v[ii] <- difftime(Sys.time(), t0)
}
When I run this code, it takes about 1 second per file, if the files have never been read before; it's much, much faster if I run the code over files that have previously been processed. Yet, according to this SO answer, the fastest time for counting the lines in a 12M row file is 0.1 seconds (my files are max 500k rows), and the SO user who recommended that fastest strategy (which used Linux wc
) also recommended that using C++ would be quite fast. I thought the C++ method I wrote would be equally as fast as the wc
method, if not faster, at least due to the fact that I am only reading, at most, the first two lines.
Am I thinking about this wrong? Is my approach wrong?
In my answer to the question linked to by the OP I mention package fpeek
, function peek_count_lines
. This is a fast function coded in C++. With a directory of 82 CSV files ranging from 4.8K lines to 108K lines on my computer (1 year old 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz running Windows 11) it gives me an average 0.03 secs per file to get the number of lines.
Then you can use these values and subset on the condition flsize >= 2
.
#path <- "my/path/omited"
fls <- list.files(path, full.names = TRUE)
# how many files
length(fls)
#> [1] 82
# range of file sizes in MB
range(file.size(fls)) / 1024L / 1024L
#> [1] 0.766923 19.999812
# total file size in MB
sum(file.size(fls)) / 1024L / 1024L
#> [1] 485.6675
# this is the main problem
t0 <- system.time(
flsize <- sapply(fls, fpeek::peek_count_lines)
)
# the files have from 4.8K to 108K lines
range(flsize)
#> [1] 4882 108503
# how many files have more than just the header line
sum(flsize >= 2)
#> [1] 82
# timings
t0
#> user system elapsed
#> 0.28 1.12 2.30
# average timings per file
t0/length(flsize)
#> user system elapsed
#> 0.003414634 0.013658537 0.028048780
Created on 2023-02-04 with reprex v2.0.2