table1:
rf_1 | rf_2 |
---|---|
['abc','a'] | ['abce','cde'] |
['abc','cde'] | ['1','2'] |
How can I split all columns start with rf into individual columns?
Final output should look like this:
rf_1a | rf_1b | rf_2a | rf_2b |
---|---|---|---|
'abc' | 'a' | 'abcde' | 'cde' |
'abc' | 'cde' | 1 | 2 |
I have tried the below example in Pyspark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType, IntegerType
dilip_sample_data = [("['abc','a']", "['abce','cde']"),
("['abc','cde']", "['1','2']")]
initial_df = spark.createDataFrame(data, ["rf_1", "rf_2"])
def extract_elements(s):
s = s.strip("[]").replace("'", "").split(",")
result = []
for x in s:
if x.isdigit():
result.append(int(x))
else:
result.append(f"'{x}'")
return result
extract_elements_udf = udf(extract_elements, ArrayType(StringType()))
dilip_final_df = df.withColumn("rf_1a", extract_elements_udf("rf_1")[0]) \
.withColumn("rf_1b", extract_elements_udf("rf_1")[1]) \
.withColumn("rf_2a", extract_elements_udf("rf_2")[0]) \
.withColumn("rf_2b", extract_elements_udf("rf_2")[1])
dilip_final_df.select("rf_1a", "rf_1b", "rf_2a", "rf_2b").show(truncate=False)
+-----+-----+------+-----+
|rf_1a|rf_1b|rf_2a |rf_2b|
+-----+-----+------+-----+
|'abc'|'a' |'abce'|'cde'|
|'abc'|'cde'|1 |2 |
+-----+-----+------+-----+
I am defining sample data and creating an initial DataFrame
Defining a function to extract elements from a string
Creating a user-defined function (UDF)
from the extract_elements function
Applying the UDF
to the DataFrame to create new columns