PySpark drop columns based on column names / String condition

I want to drop columns in a pyspark dataframe that contains any of the words in the banned_columns list and form a new dataframe out of the remaining columns

banned_columns = ["basket","cricket","ball"]
drop_these = [columns_to_drop for columns_to_drop in df.columns if columns_to_drop in banned_columns]

df_new = df.drop(*drop_these)

The idea of banned_columns is to drop any columns that start with basket and cricket, and columns that contain the word ball anywhere in their name.

The above is what I did so far, but it does not work (as in the new dataframe still contains those columns names)

Example of dataframe

 sports1basketjump | sports

In the above column name example, it will drop the column sports1basketjump because it contains the word basket.

Moreover, is using the filter or/and reduce functions adds optimization than creating list and for loops?

Solution

Your list comprehension does not do what you expect it to do. It will return an empty list, unless it exactly matches a string. For an answer on how to match a list of substrings with a list of strings check out matching list of substrings to a list of strings in Python

The df.drop(*cols) will work as you expect.