Search code examples
rdata.tablegsubfread

Replacing dots, commas and percentages marks with load `csv` file with `fread()`


We would like to manage the columns from csv file with originally three character columns class when we used the fread() with the arguments detailed in the R code used section (separator and decimal arguments are specified). The R session version is 4.2.0 and the data.table version is 1.14.2.

Input data from csv file


col_1,col_2, col_3
/100.432,"30,84 %","4,14"
/3.200,"62,89 %","1,89"
/10.100,"50,00 %","1,62"
/15.570, "40,10 %","3,41"
/900.310, "8,00 %","0,10"

Input data in R session

> dat
# A tibble: 5 × 3
 
  col_1   col_2    col_3
  <chr>  <chr>   <chr>
1 100.432 30,84 % 4,14 
2   3.200 62,89 % 1,89 
3  10.100 50,00 % 1,62 
4  15.570 40,10 % 3,41 
5 900.310  8,00 % 0,10

R code used


data.table::fread(
  x,
  sep = ',',
  dec = '.',
  na.strings = c('', 'NA')) %>%
as_tibble()

Desired output data


> dat
# A tibble: 5 × 3
 
 col_1   col_2  col_3
  <dbl>  <dbl>  <dbl>
1 100438 30.84  4.14 
2   3200 62.89  1.89 
3  10100 50.00  1.62 
4  15570 40.10  3.41 
5 900310  8.00  0.10

Question

We would like to obtain the Desired output data format.

Thanks in advance


Solution

  • You could just do some postprocessing in R:

    RAW = fread('col_1,col_2, col_3
    /100.432,"30,84Â %","4,14"
    /3.200,"62,89Â %","1,89"
    /10.100,"50,00Â %","1,62"
    /15.570, "40,10Â %","3,41"
    /900.310, "8,00Â %","0,10"')
    
    #       col_1    col_2  col_3
    #      <char>   <char> <char>
    # 1: /100.432 30,84Â %   4,14
    # 2:   /3.200 62,89Â %   1,89
    # 3:  /10.100 50,00Â %   1,62
    # 4:  /15.570 40,10Â %   3,41
    # 5: /900.310  8,00Â %   0,10
    
    library(readr)
    RAW[, lapply(.SD, \(x) parse_number(x, locale = locale(decimal_mark = ",")))]
    # Or in base R:
    RAW[, lapply(.SD, \(x) gsub("[^0-9.]", "", chartr(".,", "_.", x)) |> as.numeric())]
    
    #     col_1 col_2 col_3
    #     <num> <num> <num>
    # 1: 100432 30.84  4.14
    # 2:   3200 62.89  1.89
    # 3:  10100 50.00  1.62
    # 4:  15570 40.10  3.41
    # 5: 900310  8.00  0.10