Search code examples
rcsvtidyversereadr

read_csv() adds "\r" to *.csv input


I'm trying to read in a small (17kb), simple csv file from EdX.org (for an online course), and I've never had this trouble with readr::read_csv() before. Base-R read.csv() reads the file without generating the problem.

A small (17kb) csv file from EdX.org

library(tidyverse)
df <- read_csv("https://courses.edx.org/assets/courseware/v1/ccdc87b80d92a9c24de2f04daec5bb58/asset-v1:MITx+15.071x+1T2020+type@asset+block/WHO.csv")
head(df)

Gives this output

#> # A tibble: 6 x 13
#>   Country Region Population Under15 Over60 FertilityRate LifeExpectancy
#>   <chr>   <chr>       <dbl>   <dbl>  <dbl> <chr>                  <dbl>
#> 1 Afghan… Easte…      29825    47.4   3.82 "\r5.4\r"                 60
#> 2 Albania Europe       3162    21.3  14.9  "\r1.75\r"                74
#> 3 Algeria Africa      38482    27.4   7.17 "\r2.83\r"                73
#> 4 Andorra Europe         78    15.2  22.9  <NA>                      82
#> 5 Angola  Africa      20821    47.6   3.84 "\r6.1\r"                 51
#> 6 Antigu… Ameri…         89    26.0  12.4  "\r2.12\r"                75
#> # … with 6 more variables: ChildMortality <dbl>, CellularSubscribers <dbl>,
#> #   LiteracyRate <chr>, GNI <chr>, PrimarySchoolEnrollmentMale <chr>,
#> #   PrimarySchoolEnrollmentFemale <chr>

You'll notice that the column FertilityRate has "\r" added to the values. I've downloaded the csv file and cannot find them there.

Base-R read.csv() reads in the file with no problems, so I'm wondering what the problem is with my usage of the tidyverse read_csv().

head(df$FertilityRate)
#> [1] "\r5.4\r"  "\r1.75\r" "\r2.83\r" NA         "\r6.1\r"  "\r2.12\r"

How can I fix my usage of read_csv() so that: the "\r" strings are not there?

If possible, I'd prefer not to have to individually specify the type of every single column.


Solution

  • In a nutshell, the characters are inside the file (probably by accident) and read_csv is right to not remove them automatically: since they occur within quotes, this by convention means that a CSV parser should treat the field as-is, and not strip out whitespace characters. read.csv is wrong to do so, and this is arguably a bug.

    You can strip them out yourself once you’ve loaded the data:

    df = mutate_if(df, is.character, ~ stringr::str_remove_all(.x, '\r'))
    

    This seems to be good enough for this file, but in general I’d be wary that the file might be damaged in other ways, since the presence of these characters is clearly not intentional, and the file follows no common file ending convention (it’s neither a conventional Windows nor Unix file).