I want to drop columns in a pyspark dataframe that contains any of the words in the banned_columns
list and form a new dataframe out of the remaining columns
banned_columns = ["basket","cricket","ball"]
drop_these = [columns_to_drop for columns_to_drop in df.columns if columns_to_drop in banned_columns]
df_new = df.drop(*drop_these)
The idea of banned_columns
is to drop any columns that start with basket
and cricket
, and columns that contain the word ball
anywhere in their name.
The above is what I did so far, but it does not work (as in the new dataframe still contains those columns names)
Example of dataframe
sports1basketjump | sports
In the above column name example, it will drop the column sports1basketjump
because it contains the word basket.
Moreover, is using the filter
or/and reduce
functions adds optimization than creating list and for loops?
Your list comprehension does not do what you expect it to do. It will return an empty list, unless it exactly matches a string. For an answer on how to match a list of substrings with a list of strings check out matching list of substrings to a list of strings in Python
The df.drop(*cols)
will work as you expect.