string apache-spark pyspark azure-synapse

Quickest/nicest way to add leading zeroes between indefinite number of dots

We have a string which is consistently formatted: "1.20.3 - version name"

We need this formatted as: "01.20.03 - version name"

Occasionally, we will get "some tag - version name", in which case, we'll leave it alone.

We've separated the version number from version name using

df = df.withColumn('number', split(col('fullname'), ' ').getItem(0))

and know that we can use .when(df.fullname.substr(0,1).isDigit(), some_function) and .otherwise(df.fullname)) to only apply this to the rows we want.

But the best solution we've figured from here is to further split(col('fullname'), '.'), knowing that we will consistently get 3 new columns, then lpad each to the length of 2 and concatenate back.

I feel that there is probably a nicer solution here which is more generalized, not hardcoded for the 3 columns. Looking for any advice to make this better or if there's a general solution here which won't break if we moved to something like 10.20.30.40 in the future.

Solution

You can use the below generalized solution to add leading zeros in the string for any number of components in version number separated by . dot.

Step:1 Sample input data frame is taken.

from pyspark.sql.functions import split, when, col, concat_ws, lpad,size,max

spark = SparkSession.builder.appName("format_version_number").getOrCreate()

#sample input dataframe
data = [("1.20.3 - version name",), ("some tag - version name",), ("10.20.30.4 - version name",)]
df = spark.createDataFrame(data, ["fullname"])

fullname
1.20.3 - version name
some tag - version name
1.20.3.4 - version name

Step:2 The fullname column is split into version number and name based on the character -.

df = df.withColumn('number', split(col('fullname'), ' - ').getItem(0))
df.show()

fullname	number
1.20.3 - version name	1.20.3
some tag - version name	some tag
1.20.3.4 - version name	1.20.3.4

Step:3 First character of the number column is checked if it has a number.

df=df.withColumn('is_digit',col('number').substr(0,1).cast("int").isNotNull())
df.show()

fullname	number	is_digit
1.20.3 - version name	1.20.3	TRUE
some tag - version name	some tag	FALSE
1.20.3.4 - version name	1.20.3.4	TRUE

Step:4 If the first character of the number field is a numeric value, then number field is converted to an array of numbers. This is done by splitting the number column based on . character.

df = df.withColumn('number_split', when(df.is_digit, split(col('number'), '\.')).otherwise(None))
df.show()

fullname	number	is_digit	number_split
1.20.3 - version name	1.20.3	TRUE	[1,20,3]
some tag - version name	some tag	FALSE	null
1.20.3.4 - version name	1.20.3.4	TRUE	[1,20,3,4]

Step:5 Each value in the array is padded with zeroes and then concatenated all the values in the array back with dots .

max_size = df.agg(max(size(col('number_split')))).collect()[0][0]
df = df.withColumn('number_padded', when(df.number_split.isNotNull(), concat_ws('.', *[lpad(col('number_split')[i], 2, '0') for i in  range(0,max_size)])))
df.show()

fullname	number	is_digit	number_split	number_padded
1.20.3 - version name	1.20.3	TRUE	[1,20,3]	01.20.03
some tag - version name	some tag	FALSE	null	null
1.20.3.4 - version name	1.20.3.4	TRUE	[1,20,3,4]	01.20.03.04

Step:6 number_padded field is joined with the version name of full_name field.

df = df.withColumn('formatted_fullname', when(df.is_digit, concat_ws(' - ', df.number_padded, split(col('fullname'), ' - ').getItem(1))).otherwise(df.fullname))
df.show()

fullname	number	is_digit	number_split	number_padded	formatted_fullname
1.20.3 - version name	1.20.3	TRUE	[1,20,3]	01.20.03	01.20.03 - version name
some tag - version name	some tag	FALSE	null	null	some tag - version name
1.20.3.4 - version name	1.20.3.4	TRUE	[1,20,3,4]	01.20.03.04	01.20.03.04 - version name

Formatted_fullname field has the zero padded value.

Full code:

from pyspark.sql.functions import split, when, col, concat_ws, lpad,size,max
spark = SparkSession.builder.appName("format_version_number").getOrCreate()

#sample input dataframe
data = [("1.20.3 - version name",), ("some tag - version name",), ("10.20.30.4 - version name",)]
df = spark.createDataFrame(data, ["fullname"])

#Split the fullname column into number and name columns
df = df.withColumn('number', split(col('fullname'), ' - ').getItem(0))
df.show()

#Check if the first character of the number column is a digit
df=df.withColumn('is_digit',col('number').substr(0,1).cast("int").isNotNull())

# If the first character of the number column is a digit, split the number column into separate columns(arrays)
df = df.withColumn('number_split', when(df.is_digit, split(col('number'), '\.')).otherwise(None))

#If the number column was split into separate columns, pad each column with leading zeros and concatenate them back
max_size = df.agg(max(size(col('number_split')))).collect()[0][0]
df = df.withColumn('number_padded', when(df.number_split.isNotNull(), concat_ws('.', *[lpad(col('number_split')[i], 2, '0') for i in  range(0,max_size)])))

#Concatenate the number and name columns back into the formatted_ fullname column
df = df.withColumn('formatted_fullname', when(df.is_digit, concat_ws(' - ', df.number_padded, split(col('fullname'), ' - ').getItem(1))).otherwise(df.fullname))
df.show()

enter image description here