I have generated a dataframe that contains 9829 observations of 37 variables and saved it with write_csv.
When loading this dataframe in shiny with read_csv one column is tagged as int when its values are floating numbers, this causes all floating values in this column to be set to NA.
After close investigation it appears that the problem is that the first ~4000 observations in that column are 0 with no digits which seems to be a problem with the reading function.
A quick fix for the issue has been to sort the dataframe in descending order with the column that causes the problem before saving. But this is not a valid solution as I may have more than one column with this issue in the future.
Question: Is there a way to set write_csv to write all items in floating columns with 2 digits precision? Or to fix the issue automatically
Thank you
EDIT
library(tidyverse)
col1 <- c(c(0:5000), c(2.1,3.5))
df <- data.frame(col1)
write_csv(df, "./data_out/test/wrong_dataType_issue.csv")
df_read <- read_csv("./data_out/test/wrong_dataType_issue.csv")
summary(df_read)
col1
Min. : 0
1st Qu.:1250
Median :2500
Mean :2500
3rd Qu.:3749
Max. :4999
NA's :7
By default, read_csv()
looks at the first 1,000 rows of data. I suggest this chapter of R for Data Science for background. It's possible for the function to guess incorrectly. For example, I once had a dataset where the column gender
was marked as logical because the first 1,000 rows were all female, and the function interpreted "F" to mean "FALSE". There's the right way to fix this problem and the quick way.
The quick way
read_csv()
has an argument called guess_max
that sets how many rows to explore. You could use something like this as a hacky way to fix the problem...
read_csv("my_data.csv", guess_max = 9829)
That forces the read_csv()
function to look at every value in your dataset before guessing the column types. It'll fix your problem but it might cause more trouble in the future, especially if embedded in a Shiny app where the underlying data might change.
The right way
read_csv()
makes it easy to explicitly define the data types of all your columns. If you want to make sure that column age
is always read as numeric, use something like the following...
read_csv("my_data.csv", col_types = cols(age = col_double())