Search code examples
pythonapache-sparkpyspark

PySpark drop columns based on column names / String condition


I want to drop columns in a pyspark dataframe that contains any of the words in the banned_columns list and form a new dataframe out of the remaining columns

banned_columns = ["basket","cricket","ball"]
drop_these = [columns_to_drop for columns_to_drop in df.columns if columns_to_drop in banned_columns]

df_new = df.drop(*drop_these)

The idea of banned_columns is to drop any columns that start with basket and cricket, and columns that contain the word ball anywhere in their name.

The above is what I did so far, but it does not work (as in the new dataframe still contains those columns names)

Example of dataframe

 sports1basketjump | sports

In the above column name example, it will drop the column sports1basketjump because it contains the word basket.

Moreover, is using the filter or/and reduce functions adds optimization than creating list and for loops?


Solution

  • Your list comprehension does not do what you expect it to do. It will return an empty list, unless it exactly matches a string. For an answer on how to match a list of substrings with a list of strings check out matching list of substrings to a list of strings in Python

    The df.drop(*cols) will work as you expect.