Search code examples
apache-sparkpysparkdynamicapache-spark-sqlmultiple-columns

Dynamically select the columns in a Spark dataframe


I have data like in the dataframe below. As you can see, there are columns "2019" and "2019_p", "2020" and "2020_p", "2021" and "2021_p".

enter image description here

I want to select the final columns dynamically where if "2019" is null, take the value of "2019_p" and if the value of "2020" is null, take the value of "2020_p" and the same applies to "2021" etc.

I want to select the columns dynamically without hardcoding the column names. How do I achieve this?

I need output like this:

enter image description here


Solution

  • you can simplify ZygD's approach to just use a list comprehension with coalesce (without regex).

    # following list can be created from a source dataframe as well
    year_cols = ['2019', '2020', '2021']
    # [k for k in data_sdf.columns if k.startswith('20') and not k.endswith('_p')]
    
    data_sdf. \
        select('id', 'type',
               *[func.coalesce(c, c+'_p').alias(c) for c in year_cols]
               ). \
        show()
    
    # +---+----+----+----+----+
    # | id|type|2019|2020|2021|
    # +---+----+----+----+----+
    # |  1|   A|  50|  65|  40|
    # |  1|   B|  25|  75|  75|
    # +---+----+----+----+----+
    

    where the list comprehension would yield the following

    [func.coalesce(c, c+'_p').alias(c) for c in year_cols]
    
    # [Column<'coalesce(2019, 2019_p) AS `2019`'>,
    #  Column<'coalesce(2020, 2020_p) AS `2020`'>,
    #  Column<'coalesce(2021, 2021_p) AS `2021`'>]