Search code examples
rapache-sparksparklyr

R :Read csv numeric with comma in decimal, package sparklyr


I need to read a file of type ".csv" using the library "sparklyr", in which the numeric values appear with commas. The idea is to be able to read using "spark_read_csv()" directly.

I am using:

library(sparklyr)
library(dplyr)

f<-data.frame(DNI=c("22-e","EE-4","55-W"), 
DD=c("33,2","33.2","14,55"),CC=c("2","44,4","44,9")) 

write.csv(f,"aff.csv")

sc <- spark_connect(master = "local", spark_home = "/home/tomas/spark-2.1.0-bin-hadoop2.7/", version = "2.1.0")

df <- spark_read_csv(sc, name = "data", path = "/home/tomas/Documentos/Clusterapp/aff.csv", header = TRUE, delimiter = ",")

tbl <- sdf_copy_to(sc = sc, x =df , overwrite = T)

The problem, read the numbers as factor


Solution

  • To manipulate string inside a spark df you can use regexp_replace function as mentioned here:

    https://spark.rstudio.com/guides/textmining/

    For you problem it would work out like this:

    tbl <- sdf_copy_to(sc = sc, x =df, overwrite = T)
    
    tbl0<-tbl%>%
        mutate(DD=regexp_replace(DD,",","."),CC=regexp_replace(CC,",","."))%>%
        mutate_at(vars(c("DD","CC")),as.numeric)
    

    to check your result:

    > glimpse(tbl0)
    Observations: ??
    Variables: 3
    $ DNI <chr> "22-e", "EE-4", "55-W"
    $ DD  <dbl> 33.20, 33.20, 14.55
    $ CC  <dbl> 2.0, 44.4, 44.9