Search code examples
stringapache-sparkpysparkazure-synapse

Quickest/nicest way to add leading zeroes between indefinite number of dots


We have a string which is consistently formatted: "1.20.3 - version name"

We need this formatted as: "01.20.03 - version name"

Occasionally, we will get "some tag - version name", in which case, we'll leave it alone.

We've separated the version number from version name using

df = df.withColumn('number', split(col('fullname'), ' ').getItem(0))

and know that we can use .when(df.fullname.substr(0,1).isDigit(), some_function) and .otherwise(df.fullname)) to only apply this to the rows we want.

But the best solution we've figured from here is to further split(col('fullname'), '.'), knowing that we will consistently get 3 new columns, then lpad each to the length of 2 and concatenate back.

I feel that there is probably a nicer solution here which is more generalized, not hardcoded for the 3 columns. Looking for any advice to make this better or if there's a general solution here which won't break if we moved to something like 10.20.30.40 in the future.


Solution

  • You can use the below generalized solution to add leading zeros in the string for any number of components in version number separated by . dot.

    Step:1 Sample input data frame is taken.

    from pyspark.sql.functions import split, when, col, concat_ws, lpad,size,max
    
    spark = SparkSession.builder.appName("format_version_number").getOrCreate()
    
    #sample input dataframe
    data = [("1.20.3 - version name",), ("some tag - version name",), ("10.20.30.4 - version name",)]
    df = spark.createDataFrame(data, ["fullname"])
    
    fullname
    1.20.3 - version name
    some tag - version name
    1.20.3.4 - version name

    Step:2 The fullname column is split into version number and name based on the character -.

    df = df.withColumn('number', split(col('fullname'), ' - ').getItem(0))
    df.show()
    
    fullname number
    1.20.3 - version name 1.20.3
    some tag - version name some tag
    1.20.3.4 - version name 1.20.3.4

    Step:3 First character of the number column is checked if it has a number.

    df=df.withColumn('is_digit',col('number').substr(0,1).cast("int").isNotNull())
    df.show()
    
    fullname number is_digit
    1.20.3 - version name 1.20.3 TRUE
    some tag - version name some tag FALSE
    1.20.3.4 - version name 1.20.3.4 TRUE

    Step:4 If the first character of the number field is a numeric value, then number field is converted to an array of numbers. This is done by splitting the number column based on . character.

    df = df.withColumn('number_split', when(df.is_digit, split(col('number'), '\.')).otherwise(None))
    df.show()
    
    fullname number is_digit number_split
    1.20.3 - version name 1.20.3 TRUE [1,20,3]
    some tag - version name some tag FALSE null
    1.20.3.4 - version name 1.20.3.4 TRUE [1,20,3,4]

    Step:5 Each value in the array is padded with zeroes and then concatenated all the values in the array back with dots .

    max_size = df.agg(max(size(col('number_split')))).collect()[0][0]
    df = df.withColumn('number_padded', when(df.number_split.isNotNull(), concat_ws('.', *[lpad(col('number_split')[i], 2, '0') for i in  range(0,max_size)])))
    df.show()
    
    fullname number is_digit number_split number_padded
    1.20.3 - version name 1.20.3 TRUE [1,20,3] 01.20.03
    some tag - version name some tag FALSE null null
    1.20.3.4 - version name 1.20.3.4 TRUE [1,20,3,4] 01.20.03.04

    Step:6 number_padded field is joined with the version name of full_name field.

    df = df.withColumn('formatted_fullname', when(df.is_digit, concat_ws(' - ', df.number_padded, split(col('fullname'), ' - ').getItem(1))).otherwise(df.fullname))
    df.show()
    
    fullname number is_digit number_split number_padded formatted_fullname
    1.20.3 - version name 1.20.3 TRUE [1,20,3] 01.20.03 01.20.03 - version name
    some tag - version name some tag FALSE null null some tag - version name
    1.20.3.4 - version name 1.20.3.4 TRUE [1,20,3,4] 01.20.03.04 01.20.03.04 - version name

    Formatted_fullname field has the zero padded value.

    Full code:

    from pyspark.sql.functions import split, when, col, concat_ws, lpad,size,max
    spark = SparkSession.builder.appName("format_version_number").getOrCreate()
    
    #sample input dataframe
    data = [("1.20.3 - version name",), ("some tag - version name",), ("10.20.30.4 - version name",)]
    df = spark.createDataFrame(data, ["fullname"])
    
    #Split the fullname column into number and name columns
    df = df.withColumn('number', split(col('fullname'), ' - ').getItem(0))
    df.show()
    
    #Check if the first character of the number column is a digit
    df=df.withColumn('is_digit',col('number').substr(0,1).cast("int").isNotNull())
    
    # If the first character of the number column is a digit, split the number column into separate columns(arrays)
    df = df.withColumn('number_split', when(df.is_digit, split(col('number'), '\.')).otherwise(None))
    
    #If the number column was split into separate columns, pad each column with leading zeros and concatenate them back
    max_size = df.agg(max(size(col('number_split')))).collect()[0][0]
    df = df.withColumn('number_padded', when(df.number_split.isNotNull(), concat_ws('.', *[lpad(col('number_split')[i], 2, '0') for i in  range(0,max_size)])))
    
    #Concatenate the number and name columns back into the formatted_ fullname column
    df = df.withColumn('formatted_fullname', when(df.is_digit, concat_ws(' - ', df.number_padded, split(col('fullname'), ' - ').getItem(1))).otherwise(df.fullname))
    df.show()
    

    enter image description here