Search code examples
rdataframecsvimportexport-to-csv

R: why, how to avoid: read.table turns character (strings) to numeric by removing last character (colon)


Have a dataframe which I want to export to CSV and re-import to dataframe. When importing one column is corrupted -- by removing the colon from the end of the strings, and interpreting them as numeric.

Here a minimal example:

df <- data.frame(integers = c(1:8, NA, 10L),
                 doubles  = as.numeric(paste0(c(1:7, NA, 9, 10), ".1")),
                 strings = paste0(c(1:10),".")
                 )
df
str(df) # here the last column is "chr"

write.table(df,
            file = "df.csv",
            sep = "\t",
            na = "NA",
            row.names = FALSE,
            col.names = TRUE,
            fileEncoding = "UTF-8",
)

df <- read.table(file = "df.csv",
                 header = TRUE,
                 sep = "\t",
                 na.strings = "NA",
                 quote="\"",
                 fileEncoding = "UTF-8"
                 )
df
str(df)  # here the last column is "num"

Solution

  • With read.table, we can specify the colClasses specified in ?vector

    The atomic modes are "logical", "integer", "numeric" (synonym "double"), "complex", "character" and "raw".

    The issues is that ?read.table colClasses uses type.convert if not specified to automatically judge the type of the column

    Unless colClasses is specified, all columns are read as character columns and then converted using type.convert to logical, integer, numeric, complex or (depending on as.is) factor as appropriate.

    The relevant code in read.table would be

    ...
         do[1L] <- FALSE
        for (i in (1L:cols)[do]) {
            data[[i]] <- if (is.na(colClasses[i])) 
                type.convert(data[[i]], as.is = as.is[i], dec = dec, 
                    numerals = numerals, na.strings = character(0L))
            else if (colClasses[i] == "factor") 
                as.factor(data[[i]])
            else if (colClasses[i] == "Date") 
                as.Date(data[[i]])
            else if (colClasses[i] == "POSIXct") 
                as.POSIXct(data[[i]])
            else methods::as(data[[i]], colClasses[i])
        }
    ...
    
    df <- read.table(file = "df.csv",
                     header = TRUE,
                     sep = "\t",
                     na.strings = "NA",
                     quote="\"",
                     fileEncoding = "UTF-8", 
               colClasses = c("integer", "numeric", "character")
                     )
    

    -checking the struture

    str(df)
    'data.frame':   10 obs. of  3 variables:
     $ integers: int  1 2 3 4 5 6 7 8 NA 10
     $ doubles : num  1.1 2.1 3.1 4.1 5.1 6.1 7.1 NA 9.1 10.1
     $ strings : chr  "1." "2." "3." "4." ...