Search code examples
dataframeselectpysparkmultiple-columns

Pyspark dynamic column selection from dataframe


I have a dataframe with multiple columns as t_orno,t_pono, t_sqnb ,t_pric,....and so on(it's a table with multiple columns). The 2nd dataframe contains certain name of the columns from 1st dataframe. Eg.

columnname
t_pono
t_pric
:
:

I need to select only those columns from the 1st dataframe whose name is present in the 2nd. In above example t_pono,t_pric.

How can this be done?


Solution

  • Let's say you have the following columns (which can be obtained using df.columns, which returns a list):

    df1_cols = ["t_orno", "t_pono", "t_sqnb", "t_pric"]
    df2_cols = ["columnname", "t_pono", "t_pric"]
    

    To get only those columns from the first dataframe that are present in the second one, you can do set intersection (and I cast it to a list, so it can be used to select data):

    list(set(df1_cols).intersection(df2_cols))
    

    And we get the result:

    ["t_pono", "t_pric"]
    

    To put it all together and select only those columns:

    select_columns = list(set(df1_cols).intersection(df2_cols))
    
    new_df = df1.select(*select_columns)