I need to cast numbers from a column with StringType
to a DecimalType
. The Decimal type should have a predefined precision and scale, for example, Decimal(2,1)
. While the numbers in the String column can not fit to this precision and scale.
Here is the example:
from pyspark.sql.types import DecimalType
import pyspark.sql.functions as F
df = spark.createDataFrame([("-001.399", )],["String"]).withColumn("Decimal", F.col("String").cast(DecimalType(2,1)))
df.show(truncate=False)
+--------+-------+
|String |Decimal|
+--------+-------+
|-001.399|-1.4 |
+--------+-------+
From the result I see that the number in Decimal format is rounded, which is not a desired behavior in my use case.
1. Is it possible to cast String to Decimal without rounding?
The expected result would be:
+--------+-------+
|String |Decimal|
+--------+-------+
|-001.399|-1.3 |
+--------+-------+
2. Is it possible to "strictly" cast String to Decimal so that it will return null
if the actual scale does not exactly fit to specified scale?
The expected result would be (like if I would try to cast "-0011.399" to Decimal(2,1)
):
+--------+-------+
|String |Decimal|
+--------+-------+
|-001.399|null |
+--------+-------+
The following regex would only keep 1 digit after the .
decimal point. You could apply this transformation before casting to decimal type.
F.regexp_extract("String", r"^([^.]*(\.\d)?)", 1)
Full test:
from pyspark.sql import functions as F
df = spark.createDataFrame([("-001.399",), ("001.399",), ("-001",), ("001",), ("-001.",), ("001.",), ("-001.39",), ("-001.3",), ("-.39",), (".39",)], ["String"])
df = df.withColumn("String2", F.regexp_extract("String", r"^([^.]*(\.\d)?)", 1))
df = df.withColumn("Decimal", F.col("String2").cast('decimal(2,1)'))
df.show()
# +--------+-------+-------+
# | String|String2|Decimal|
# +--------+-------+-------+
# |-001.399| -001.3| -1.3|
# | 001.399| 001.3| 1.3|
# | -001| -001| -1.0|
# | 001| 001| 1.0|
# | -001.| -001| -1.0|
# | 001.| 001| 1.0|
# | -001.39| -001.3| -1.3|
# | -001.3| -001.3| -1.3|
# | -.39| -.3| -0.3|
# | .39| .3| 0.3|
# +--------+-------+-------+
If you need more digits after the decimal point, you can insert e.g. {0,3}
into the regex (for up to 3 digits after the decimal point).
r"^([^.]*(\.\d{0,3})?)"