I have this dataframe which contains user id and tags related to the user. what is the best way to filter out only those rows which has tags containing any one among this list. data_science = ['python', 'r', 'matlab', 'sas', 'excel', 'sql'] I have tried the below code in pandas, it does filter out to a certain extent but it gives tags which has any tag similar to the list. for example, for sql, it throws sql-server. Could you suggest a better approach?
df_ds = df_combo[df_combo["Tag"].astype(str).str.contains('(python|excel|sql|matlab)', regex=True)]
I think a possible simpler approach, but perhaps verbose is this:
# create a set with the queried tags
tags = {'python', 'r', 'matlab', 'sas', 'excel', 'sql'}
# create an auxiliary column where all the tags are separated elements of a set
df_combo['Tag-set'] = df_combo['Tag'].str.split(',').apply(lambda x: [e.strip() for e in x]).tolist()
# use sets to check the intersection
df_combo['Tag-set'] = df_combo['Tag-set'].apply(set)
# filter the list
df_fd = df_combo[df_combo['Tag-set'].apply(lambda x: len(x & tags) > 0)]
The idea is to clean all the strings using split
and strip
, and then keep only those where the intersection has at least one element.