I have a list called segments
having list of column names. Each column value also contains column name. for ex: genre_list
is a column, which can have value genre_list_comedy
/ genre_list_drama_val_us
etc.
I need to iterate list of columns in segments
, for the value in that column, split column name name and value. ie. split the value based on the column name and then keep the column name as one part and the rest as another part. Underscore is separator.
Only split the value based on the column name and then keep the column name as one part and the rest as another part.
Input: sample data with one column. But in general it will have multiple columns , also listed in segments
list.
+-----------------------
| genre_list |
+-----------------------
| genre_list_comedy_val|
| genre_list_drama_val_us |
| genre_list_action_val|
+-----------------------
Output: need something like below
+---------------------+------------------+
| genre_list_column_name | genre_list_value |
+---------------------+------------------+
| genre_list | comedy_val |
| genre_list | drama_val_us |
| genre_list | action_val |
+---------------------+------------------+
I have below code and planning to use split
. But split functions also splits _ in column name. I'm not sure how I can split column name as key and value as rest of the value.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split
data = [("genre_list_comedy_val",),
("genre_list_drama_val_us",),
("genre_list_action_val",)]
# Create DataFrame
df = spark.createDataFrame(data, ["genre_list"])
df.show()
You can iterate through the columns names you want to split, then split by the column name + "_"
I have extended your data set to show an example:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, col, split
spark = SparkSession.builder.appName("Test").getOrCreate()
data = [("genre_list_comedy_val","test_list_abc"),
("genre_list_drama_val_us","test_list_def"),
("genre_list_action_val","test_list_gh")]
# Create DataFrame
df = spark.createDataFrame(data, ["genre_list","test_list"])
df.show(truncate=False)
+-----------------------+-------------+
|genre_list |test_list |
+-----------------------+-------------+
|genre_list_comedy_val |test_list_abc|
|genre_list_drama_val_us|test_list_def|
|genre_list_action_val |test_list_gh |
+-----------------------+-------------+
Then apply the column name split logic:
split_cols = ['genre_list','test_list']
for split_col in split_cols:
split_data = split(df[split_col], f'{split_col}_')
df = df.withColumns({
f"{split_col}_column_name": lit(split_col),
f"{split_col}_value": split_data.getItem(1)
}).drop(split_col)
+----------------------+----------------+---------------------+---------------+
|genre_list_column_name|genre_list_value|test_list_column_name|test_list_value|
+----------------------+----------------+---------------------+---------------+
| genre_list| comedy_val| test_list| abc|
| genre_list| drama_val_us| test_list| def|
| genre_list| action_val| test_list| gh|
+----------------------+----------------+---------------------+---------------+