We have a string which is consistently formatted: "1.20.3 - version name"
We need this formatted as: "01.20.03 - version name"
Occasionally, we will get "some tag - version name", in which case, we'll leave it alone.
We've separated the version number from version name using
df = df.withColumn('number', split(col('fullname'), ' ').getItem(0))
and know that we can use .when(df.fullname.substr(0,1).isDigit(), some_function)
and .otherwise(df.fullname))
to only apply this to the rows we want.
But the best solution we've figured from here is to further split(col('fullname'), '.')
, knowing that we will consistently get 3 new columns, then lpad
each to the length of 2 and concatenate back.
I feel that there is probably a nicer solution here which is more generalized, not hardcoded for the 3 columns. Looking for any advice to make this better or if there's a general solution here which won't break if we moved to something like 10.20.30.40 in the future.
You can use the below generalized solution to add leading zeros in the string for any number of components in version number separated by .
dot.
Step:1 Sample input data frame is taken.
from pyspark.sql.functions import split, when, col, concat_ws, lpad,size,max
spark = SparkSession.builder.appName("format_version_number").getOrCreate()
#sample input dataframe
data = [("1.20.3 - version name",), ("some tag - version name",), ("10.20.30.4 - version name",)]
df = spark.createDataFrame(data, ["fullname"])
fullname |
---|
1.20.3 - version name |
some tag - version name |
1.20.3.4 - version name |
Step:2 The fullname
column is split into version number and name based on the character -
.
df = df.withColumn('number', split(col('fullname'), ' - ').getItem(0))
df.show()
fullname | number |
---|---|
1.20.3 - version name | 1.20.3 |
some tag - version name | some tag |
1.20.3.4 - version name | 1.20.3.4 |
Step:3 First character of the number column is checked if it has a number.
df=df.withColumn('is_digit',col('number').substr(0,1).cast("int").isNotNull())
df.show()
fullname | number | is_digit |
---|---|---|
1.20.3 - version name | 1.20.3 | TRUE |
some tag - version name | some tag | FALSE |
1.20.3.4 - version name | 1.20.3.4 | TRUE |
Step:4 If the first character of the number field is a numeric value, then number field is converted to an array of numbers. This is done by splitting the number column based on .
character.
df = df.withColumn('number_split', when(df.is_digit, split(col('number'), '\.')).otherwise(None))
df.show()
fullname | number | is_digit | number_split |
---|---|---|---|
1.20.3 - version name | 1.20.3 | TRUE | [1,20,3] |
some tag - version name | some tag | FALSE | null |
1.20.3.4 - version name | 1.20.3.4 | TRUE | [1,20,3,4] |
Step:5 Each value in the array is padded with zeroes and then concatenated all the values in the array back with dots .
max_size = df.agg(max(size(col('number_split')))).collect()[0][0]
df = df.withColumn('number_padded', when(df.number_split.isNotNull(), concat_ws('.', *[lpad(col('number_split')[i], 2, '0') for i in range(0,max_size)])))
df.show()
fullname | number | is_digit | number_split | number_padded |
---|---|---|---|---|
1.20.3 - version name | 1.20.3 | TRUE | [1,20,3] | 01.20.03 |
some tag - version name | some tag | FALSE | null | null |
1.20.3.4 - version name | 1.20.3.4 | TRUE | [1,20,3,4] | 01.20.03.04 |
Step:6 number_padded field is joined with the version name of full_name field.
df = df.withColumn('formatted_fullname', when(df.is_digit, concat_ws(' - ', df.number_padded, split(col('fullname'), ' - ').getItem(1))).otherwise(df.fullname))
df.show()
fullname | number | is_digit | number_split | number_padded | formatted_fullname |
---|---|---|---|---|---|
1.20.3 - version name | 1.20.3 | TRUE | [1,20,3] | 01.20.03 | 01.20.03 - version name |
some tag - version name | some tag | FALSE | null | null | some tag - version name |
1.20.3.4 - version name | 1.20.3.4 | TRUE | [1,20,3,4] | 01.20.03.04 | 01.20.03.04 - version name |
Formatted_fullname
field has the zero padded value.
Full code:
from pyspark.sql.functions import split, when, col, concat_ws, lpad,size,max
spark = SparkSession.builder.appName("format_version_number").getOrCreate()
#sample input dataframe
data = [("1.20.3 - version name",), ("some tag - version name",), ("10.20.30.4 - version name",)]
df = spark.createDataFrame(data, ["fullname"])
#Split the fullname column into number and name columns
df = df.withColumn('number', split(col('fullname'), ' - ').getItem(0))
df.show()
#Check if the first character of the number column is a digit
df=df.withColumn('is_digit',col('number').substr(0,1).cast("int").isNotNull())
# If the first character of the number column is a digit, split the number column into separate columns(arrays)
df = df.withColumn('number_split', when(df.is_digit, split(col('number'), '\.')).otherwise(None))
#If the number column was split into separate columns, pad each column with leading zeros and concatenate them back
max_size = df.agg(max(size(col('number_split')))).collect()[0][0]
df = df.withColumn('number_padded', when(df.number_split.isNotNull(), concat_ws('.', *[lpad(col('number_split')[i], 2, '0') for i in range(0,max_size)])))
#Concatenate the number and name columns back into the formatted_ fullname column
df = df.withColumn('formatted_fullname', when(df.is_digit, concat_ws(' - ', df.number_padded, split(col('fullname'), ' - ').getItem(1))).otherwise(df.fullname))
df.show()