I'm trying to read in a small (17kb), simple csv file from EdX.org (for an online course), and I've never had this trouble with readr::read_csv()
before. Base-R read.csv()
reads the file without generating the problem.
A small (17kb) csv file from EdX.org
library(tidyverse)
df <- read_csv("https://courses.edx.org/assets/courseware/v1/ccdc87b80d92a9c24de2f04daec5bb58/asset-v1:MITx+15.071x+1T2020+type@asset+block/WHO.csv")
head(df)
Gives this output
#> # A tibble: 6 x 13
#> Country Region Population Under15 Over60 FertilityRate LifeExpectancy
#> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 Afghan… Easte… 29825 47.4 3.82 "\r5.4\r" 60
#> 2 Albania Europe 3162 21.3 14.9 "\r1.75\r" 74
#> 3 Algeria Africa 38482 27.4 7.17 "\r2.83\r" 73
#> 4 Andorra Europe 78 15.2 22.9 <NA> 82
#> 5 Angola Africa 20821 47.6 3.84 "\r6.1\r" 51
#> 6 Antigu… Ameri… 89 26.0 12.4 "\r2.12\r" 75
#> # … with 6 more variables: ChildMortality <dbl>, CellularSubscribers <dbl>,
#> # LiteracyRate <chr>, GNI <chr>, PrimarySchoolEnrollmentMale <chr>,
#> # PrimarySchoolEnrollmentFemale <chr>
You'll notice that the column FertilityRate
has "\r" added to the values. I've downloaded the csv file and cannot find them there.
Base-R read.csv()
reads in the file with no problems, so I'm wondering what the problem is with my usage of the tidyverse read_csv()
.
head(df$FertilityRate)
#> [1] "\r5.4\r" "\r1.75\r" "\r2.83\r" NA "\r6.1\r" "\r2.12\r"
How can I fix my usage of read_csv()
so that: the "\r" strings are not there?
If possible, I'd prefer not to have to individually specify the type of every single column.
In a nutshell, the characters are inside the file (probably by accident) and read_csv
is right to not remove them automatically: since they occur within quotes, this by convention means that a CSV parser should treat the field as-is, and not strip out whitespace characters. read.csv
is wrong to do so, and this is arguably a bug.
You can strip them out yourself once you’ve loaded the data:
df = mutate_if(df, is.character, ~ stringr::str_remove_all(.x, '\r'))
This seems to be good enough for this file, but in general I’d be wary that the file might be damaged in other ways, since the presence of these characters is clearly not intentional, and the file follows no common file ending convention (it’s neither a conventional Windows nor Unix file).