Search code examples
regexstringpysparknumeric

Pyspark handling number with comma to decimal


I have the following column that need to be transformerd into a decimal. Currently the column ent_Rentabiliteit_ent_rentabiliteit is a string and I need to transform to a data type which returns the same values. So after transformation values such as -0.7 or -1.2 must be showed.

enter image description here

My current code looks currently like:

from pyspark.sql import functions as F

df2test = df2.withColumn(
    "ent_Rentabiliteit_ent_rentabiliteit ",
    F.regexp_replace("ent_Rentabiliteit_ent_rentabiliteit ", ".", ".").cast(
        DecimalType()
    ),
)

Any suggestions? many thanks


Solution

  • You could just cast the column to DecimalType.

    Example:

    spark = SparkSession.builder.getOrCreate()
    
    data = [
        {
            "ent_Rentabiliteit_ent_rentabiliteit": "1.23",
        },
        {
            "ent_Rentabiliteit_ent_rentabiliteit": "-2.33",
        },
    ]
    df = spark.createDataFrame(data)
    df = df.withColumn(
        "new_col", F.col("ent_Rentabiliteit_ent_rentabiliteit").cast(DecimalType(10, 4))
    )
    

    Result:

    +-----------------------------------+-------+                                   
    |ent_Rentabiliteit_ent_rentabiliteit|new_col|
    +-----------------------------------+-------+
    |                               1.23| 1.2300|
    |                              -2.33|-2.3300|
    +-----------------------------------+-------+
    

    Schema:

    root
     |-- ent_Rentabiliteit_ent_rentabiliteit: string (nullable = true)
     |-- new_col: decimal(10,4) (nullable = true)
    

    Please keep in mind that DecimalType must have fixed precision. When creating a DecimalType, the default precision and scale is (10, 0). When inferring schema from decimal.Decimal objects, it will be DecimalType(38, 18).

    Please read more about precision here: DecimalType