Search code examples
rapache-sparksparklyr

Sparklyr - Decimal precision 8 exceeds max precision 7


I'm trying to copy a big database into Spark using spark_read_csv, but I'm getting the following error as output:

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 16.0 failed 4 times, most recent failure: Lost task 0.3 in stage 16.0 (TID 176, 10.1.2.235): java.lang.IllegalArgumentException: requirement failed: Decimal precision 8 exceeds max precision 7

data_tbl <- spark_read_csv(sc, "data", "D:/base_csv", delimiter = "|", overwrite = TRUE)

It's a big data set, about 5.8 million of records, with my dataset I have data of types Int, num and chr.


Solution

  • I think you have a couple options depending on the spark version that you're using

    Spark >=1.6.1

    from here: https://docs.databricks.com/spark/latest/sparkr/functions/read.df.html it seems, you can specifically specify your schema to force it to use doubles

    csvSchema <- structType(structField("carat", "double"), structField("color", "string"))
    diamondsLoadWithSchema<- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
                                     source = "csv", header="true", schema = csvSchema)
    

    Spark < 1.6.1 consider test.csv

    1,a,4.1234567890
    2,b,9.0987654321
    

    you can easily make this more efficient, but I think you get the gist

    linesplit <- function(x){
      tmp <- strsplit(x,",")
      return ( tmp)
    }
    
    lineconvert <- function(x){
      arow <- x[[1]]
      converted <- list(as.integer(arow[1]), as.character(arow[2]),as.double(arow[3]))
      return (converted)
    }
    rdd <- SparkR:::textFile(sc,'/path/to/test.csv')
    lnspl <- SparkR:::map(rdd, linesplit)
    ll2 <- SparkR:::map(lnspl,lineconvert)
    ddf <- createDataFrame(sqlContext,ll2)
    head(ddf)
    
      _1 _2           _3
    1  1  a 4.1234567890
    2  2  b 9.0987654321
    

    NOTE: the SparkR::: methods are private for a reason, the docs say 'be careful when you use this'