Search code examples
stringapache-sparkpysparksplit

Split value - column name and rest of the value


I have a list called segments having list of column names. Each column value also contains column name. for ex: genre_list is a column, which can have value genre_list_comedy / genre_list_drama_val_us etc.

I need to iterate list of columns in segments, for the value in that column, split column name name and value. ie. split the value based on the column name and then keep the column name as one part and the rest as another part. Underscore is separator.

  • column name also has underscore, dont split underscore in column name
  • value can have multiple underscore, dont split underscore in column value

Only split the value based on the column name and then keep the column name as one part and the rest as another part.

Input: sample data with one column. But in general it will have multiple columns , also listed in segments list.

+-----------------------
| genre_list           |
+-----------------------
| genre_list_comedy_val|
| genre_list_drama_val_us |
| genre_list_action_val|
+-----------------------

Output: need something like below

+---------------------+------------------+
| genre_list_column_name | genre_list_value |
+---------------------+------------------+
| genre_list             | comedy_val       |
| genre_list             | drama_val_us     |
| genre_list             | action_val       |
+---------------------+------------------+

I have below code and planning to use split. But split functions also splits _ in column name. I'm not sure how I can split column name as key and value as rest of the value.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split


data = [("genre_list_comedy_val",),
        ("genre_list_drama_val_us",),
        ("genre_list_action_val",)]

# Create DataFrame
df = spark.createDataFrame(data, ["genre_list"])
df.show()

Solution

  • You can iterate through the columns names you want to split, then split by the column name + "_"

    I have extended your data set to show an example:

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import lit, col, split
    
    spark = SparkSession.builder.appName("Test").getOrCreate()
    
    data = [("genre_list_comedy_val","test_list_abc"),
            ("genre_list_drama_val_us","test_list_def"),
            ("genre_list_action_val","test_list_gh")]
    
    # Create DataFrame
    df = spark.createDataFrame(data, ["genre_list","test_list"])
    df.show(truncate=False)
    
    +-----------------------+-------------+
    |genre_list             |test_list    |
    +-----------------------+-------------+
    |genre_list_comedy_val  |test_list_abc|
    |genre_list_drama_val_us|test_list_def|
    |genre_list_action_val  |test_list_gh |
    +-----------------------+-------------+
    

    Then apply the column name split logic:

    split_cols = ['genre_list','test_list']
    
    for split_col in split_cols:
        split_data = split(df[split_col], f'{split_col}_')
        df = df.withColumns({
            f"{split_col}_column_name": lit(split_col),
            f"{split_col}_value": split_data.getItem(1)
        }).drop(split_col)
    
    +----------------------+----------------+---------------------+---------------+
    |genre_list_column_name|genre_list_value|test_list_column_name|test_list_value|
    +----------------------+----------------+---------------------+---------------+
    |            genre_list|      comedy_val|            test_list|            abc|
    |            genre_list|    drama_val_us|            test_list|            def|
    |            genre_list|      action_val|            test_list|             gh|
    +----------------------+----------------+---------------------+---------------+