Search code examples
rtidyversereadr

Import with read_csv without guessing column types in R


Is there a way to use read_csv from the readr package and not guess the column type?

The function documentation tells about this argument: guess_max = min(1000, n_max), which suggested to me that the standard value of n_max (which is Inf) is a viable option. It wasn't - it crashed the entire computer. No "R does not respond", no close the application, no moving mouse or any keyboard response - I had to restart using the power button.

I tried high values for guess_max which are below Inf, but the problem is that this makes everything slower the higher the value is. Right now I use the following code instead.

# how many rows?
rowsInFile <- read_csv(
        "sources/features.csv"
        , col_types = cols(.default = "c")) %>%
    nrow()

# ...use that to not guess
df <- read_csv("sources/features.csv", guess_max = rowsInFile)
rm(rowsInFile)

I.e. I import the file to know how many rows and then "guess" up to that row. But I feel like there's gotta be a better way. Anyone got the idea that will sound obvious to me after I read it?


Solution

  • If you don't care about performance try this combination of length() and count.fields()

    length(count.fields("sources/features.csv", skip = 1))

    count.fields() counts the number of fields, as separated by sep, in each of the lines of file read, so if we measure length() of that outcome, effectively we get the total number of rows in the file.

    Uninvited comment: Correct me if I am wrong please, but from what I understand you are trying to take guess_max = nrow because you are assuming that if R has seen all the rows in a column then it is not guessing what is the class of that column?

    I am not sure that is how it will work. Even if R looks at all the rows, it will still, in true sense of the word, guess what the class of that column.