Search code examples
rdataframetidyversereadr

Column wrongly tagged as int when is num


I have generated a dataframe that contains 9829 observations of 37 variables and saved it with write_csv.

When loading this dataframe in shiny with read_csv one column is tagged as int when its values are floating numbers, this causes all floating values in this column to be set to NA.

After close investigation it appears that the problem is that the first ~4000 observations in that column are 0 with no digits which seems to be a problem with the reading function.

A quick fix for the issue has been to sort the dataframe in descending order with the column that causes the problem before saving. But this is not a valid solution as I may have more than one column with this issue in the future.

Question: Is there a way to set write_csv to write all items in floating columns with 2 digits precision? Or to fix the issue automatically

Thank you

EDIT

library(tidyverse)

col1 <- c(c(0:5000), c(2.1,3.5))
df <- data.frame(col1)

write_csv(df, "./data_out/test/wrong_dataType_issue.csv")
df_read <- read_csv("./data_out/test/wrong_dataType_issue.csv")
summary(df_read)

 col1     
 Min.   :   0  
 1st Qu.:1250  
 Median :2500  
 Mean   :2500  
 3rd Qu.:3749  
 Max.   :4999  
 NA's   :7     

Solution

  • By default, read_csv() looks at the first 1,000 rows of data. I suggest this chapter of R for Data Science for background. It's possible for the function to guess incorrectly. For example, I once had a dataset where the column gender was marked as logical because the first 1,000 rows were all female, and the function interpreted "F" to mean "FALSE". There's the right way to fix this problem and the quick way.

    The quick way

    read_csv() has an argument called guess_max that sets how many rows to explore. You could use something like this as a hacky way to fix the problem...

    read_csv("my_data.csv", guess_max = 9829)
    

    That forces the read_csv() function to look at every value in your dataset before guessing the column types. It'll fix your problem but it might cause more trouble in the future, especially if embedded in a Shiny app where the underlying data might change.

    The right way

    read_csv() makes it easy to explicitly define the data types of all your columns. If you want to make sure that column age is always read as numeric, use something like the following...

    read_csv("my_data.csv", col_types = cols(age = col_double())