Search code examples
apache-sparkpysparktypesapache-spark-sqldecimal

Create column of decimal type when creating a dataframe


I would like to provide numbers when creating a Spark dataframe. I have issues providing decimal type numbers.

This way the number gets truncated:

df = spark.createDataFrame([(10234567891023456789.5, )], ["numb"])
df = df.withColumn("numb_dec", F.col("numb").cast("decimal(30,1)"))
df.show(truncate=False)
#+---------------------+----------------------+
#|numb                 |numb_dec              |
#+---------------------+----------------------+
#|1.0234567891023456E19|10234567891023456000.0|
#+---------------------+----------------------+

This fails:

df = spark.createDataFrame([(10234567891023456789.5, )], "numb decimal(30,1)")
df.show(truncate=False)

TypeError: field numb: DecimalType(30,1) can not accept object 1.0234567891023456e+19 in type <class 'float'>

How to correctly provide big decimal numbers so that they wouldn't get truncated?


Solution

  • Maybe this is related to some differences in floating points representation between Python and Spark. You can try passing string values when creating dataframe instead:

    df = spark.createDataFrame([("10234567891023456789.5", )], ["numb"])
    
    df = df.withColumn("numb_dec", F.col("numb").cast("decimal(30,1)"))
    df.show(truncate=False)
    #+----------------------+----------------------+
    #|numb                  |numb_dec              |
    #+----------------------+----------------------+
    #|10234567891023456789.5|10234567891023456789.5|
    #+----------------------+----------------------+