Search code examples
scalaapache-sparkdataframeapache-spark-sqlspark-shell

convert scientific notation in string format to numeric in spark dataframe


Day_Date,timeofday_desc,Timeofday_hour,Timeofday_minute,Timeofday_second,value
2017-12-18,12:21:02 AM,0,21,2,“1.779209040E+08”
2017-12-19,12:21:02 AM,0,21,2,“1.779209040E+08”
2017-12-20,12:30:52 AM,0,30,52,“1.779209040E+08”
2017-12-21,12:30:52 AM,0,30,52,“1.779209040E+08”
2017-12-22,12:47:10 AM,0,47,10,“1.779209040E+08”
2017-12-23,12:47:10 AM,0,47,10,“1.779209040E+08”
2017-12-24,02:46:59 AM,2,46,59,“1.779209040E+08”
2017-12-25,02:46:59 AM,2,46,59,“1.779209040E+08”
2017-12-26,03:10:27 AM,3,10,27,“1.779209040E+08”
2017-12-27,03:10:27 AM,3,10,27,“1.779209040E+08”
2017-12-28,03:52:08 AM,3,52,8,“1.779209040E+08”

I am trying to convert value column to 177920904

val df1 = df.withColumn("s", 'value.cast("Decimal(10,4)")).drop("value").withColumnRenamed("s", "value")

also tried casting value as Float, Double. Always get null as output

df1.select("value").show()


+-----------+
|   value   |
+-----------+
|       null|
|       null|
|       null|
|       null|
|       null|
|       null|
|       null|
|       null|

df.printSchema

root
 |-- Day_Date: string (nullable = true)
 |-- timeofday_desc: string (nullable = true)
 |-- Timeofday_hour: string (nullable = true)
 |-- Timeofday_minute: string (nullable = true)
 |-- Timeofday_second: string (nullable = true)
 |-- value: string (nullable = true)

Solution

  • Just need to cast it to decimal with enough room to fit the number.

    Decimal is Decimal(precision, scale), so Decimal(10, 4) means 10 digits in total, 6 at the left of the dot, and 4 to the right, so the number does not fit in your Decimal type.

    From the documentation

    precision represents the total number of digits that can be represented

    scale represents the number of fractional digits. This value must be less than or equal to precision. A scale of 0 produces integral values, with no fractional part

    Since you don't want any number to the right, you can try this

    df.withColumn("s", 'value.cast("Decimal(10,0)"))
    

    If you want to keep 4 decimal digits, you can just change it to

    df.withColumn("s", 'value.cast("Decimal(14,4)"))
    

    INPUT

    df.show
    +---------------+
    |          value|
    +---------------+
    |1.779209040E+08|
    +---------------+
    

    OUTPUT

    scala> df.withColumn("s", 'value.cast("Decimal(10,0)")).show
    +---------------+---------+
    |          value|        s|
    +---------------+---------+
    |1.779209040E+08|177920904|
    +---------------+---------+
    

    FULL SOLUTION

    Without dropping nor renamig

    val df1 = df.withColumn("value", 'value.cast("Decimal(10,0)"))
    

    FIX INPUT DATA

    As I said in the comment, the problem is that your numbers contain some weird characters around them, you should remove it before casting

    ORIGINAL

    scala> df.show
    +----------+--------------+--------------+----------------+----------------+-----------------+
    |  Day_Date|timeofday_desc|Timeofday_hour|Timeofday_minute|Timeofday_second|            value|
    +----------+--------------+--------------+----------------+----------------+-----------------+
    |2017-12-18|   12:21:02 AM|             0|              21|               2| ?1.779209040E+08|
    |2017-12-19|   12:21:02 AM|             0|              21|               2|?1.779209040E+08?|
    |2017-12-20|   12:30:52 AM|             0|              30|              52| ?1.779209040E+08|
    |2017-12-21|   12:30:52 AM|             0|              30|              52| ?1.779209040E+08|
    |2017-12-22|   12:47:10 AM|             0|              47|              10| ?1.779209040E+08|
    |2017-12-23|   12:47:10 AM|             0|              47|              10| ?1.779209040E+08|
    |2017-12-24|   02:46:59 AM|             2|              46|              59| ?1.779209040E+08|
    |2017-12-25|   02:46:59 AM|             2|              46|              59| ?1.779209040E+08|
    |2017-12-26|   03:10:27 AM|             3|              10|              27| ?1.779209040E+08|
    |2017-12-27|   03:10:27 AM|             3|              10|              27| ?1.779209040E+08|
    |2017-12-28|   03:52:08 AM|             3|              52|               8| ?1.779209040E+08|
    +----------+--------------+--------------+----------------+----------------+-----------------+
    

    There are many ways to remove them, a quick one is with an UDF and a regular expression to remove everything but numbers, letters, dot, + and -

     def clean(input: String) = input.replaceAll("[^a-zA-Z0-9\\+\\.-]", "")
     val cleanUDF = udf(clean _ )
    df.withColumn("value", cleanUDF($"value").cast(DecimalType(10,0))).show
    +----------+--------------+--------------+----------------+----------------+---------+
    |  Day_Date|timeofday_desc|Timeofday_hour|Timeofday_minute|Timeofday_second|    value|
    +----------+--------------+--------------+----------------+----------------+---------+
    |2017-12-18|   12:21:02 AM|             0|              21|               2|177920904|
    |2017-12-19|   12:21:02 AM|             0|              21|               2|177920904|
    |2017-12-20|   12:30:52 AM|             0|              30|              52|177920904|
    |2017-12-21|   12:30:52 AM|             0|              30|              52|177920904|
    |2017-12-22|   12:47:10 AM|             0|              47|              10|177920904|
    |2017-12-23|   12:47:10 AM|             0|              47|              10|177920904|
    |2017-12-24|   02:46:59 AM|             2|              46|              59|177920904|
    |2017-12-25|   02:46:59 AM|             2|              46|              59|177920904|
    |2017-12-26|   03:10:27 AM|             3|              10|              27|177920904|
    |2017-12-27|   03:10:27 AM|             3|              10|              27|177920904|
    |2017-12-28|   03:52:08 AM|             3|              52|               8|177920904|
    +----------+--------------+--------------+----------------+----------------+---------+