Search code examples
scalaapache-sparkapache-spark-dataset

Cannot up cast `ordId` from string to int as it may truncate


I am trying to read a small file as a Dataset but its giving the error

"Cannot up cast ordId from string to int as it may truncate".

Here is the code:

object Main {
  case class Orders(ordId: Int, custId: Int, amount: Float, date: String)

  def main(args : Array[String]): Unit ={

    val schema = Encoders.product[Orders].schema

      val spark = SparkSession.builder
        .master ("local[*]")
        .appName ("")
        .getOrCreate ()


    val df = spark.read.option("header",true).csv("/mnt/data/orders.txt")


    import spark.implicits._
    val ds = df.as[Orders]

  }
}

orders.txt

ordId,custId,amount,date
1234,123,400,20190112
2345,456,600,20190122
1345,123,500,20190123
3456,345,800,20190202
5678,123,600,20190203
6578,455,900,20190301

How can I resolve this error?. Also I would like to know that do I first need to read the file as a Dataframe and then convert to a Dataset?


Solution

  • Try by passing schema(using .schema) while reading as DataFrame.

    import org.apache.spark.sql.Encoders
    val schema = Encoders.product[Orders].schema
    val ds=spark.read.option("header",true).schema(schema).csv("/mnt/data/orders.txt").as[Orders]
    ds.show()
    

    Result:

    +-----+------+------+--------+
    |ordId|custId|amount|    date|
    +-----+------+------+--------+
    | 1234|   123| 400.0|20190112|
    | 2345|   456| 600.0|20190122|
    | 1345|   123| 500.0|20190123|
    | 3456|   345| 800.0|20190202|
    | 5678|   123| 600.0|20190203|
    | 6578|   455| 900.0|20190301|
    +-----+------+------+--------+
    

    Schema:

    ds.printSchema()
    
    root
     |-- ordId: integer (nullable = true)
     |-- custId: integer (nullable = true)
     |-- amount: float (nullable = true)
     |-- date: string (nullable = true)
    

    Update:

    There are multiple ways to get month info extracted from date column

    1. Using unix_timestamp,from_unixtime functions:
    ds.withColumn("mnth",from_unixtime(unix_timestamp($"date","yyyyMMdd"),"MMM")).show()
    

    (or)

    1. Using to_date,date_format functions:
    ds.withColumn("mnth",date_format(to_date($"date","yyyyMMdd"),"MMM")).show()
    

    Result:

    +-----+------+------+--------+----+
    |ordId|custId|amount|    date|mnth|
    +-----+------+------+--------+----+
    | 1234|   123| 400.0|20190112| Jan|
    | 2345|   456| 600.0|20190122| Jan|
    | 1345|   123| 500.0|20190123| Jan|
    | 3456|   345| 800.0|20190202| Feb|
    | 5678|   123| 600.0|20190203| Feb|
    | 6578|   455| 900.0|20190301| Mar|
    +-----+------+------+--------+----+