Search code examples
pythonpandasdataframedata-analysiseda

How do I filter only those rows which contains any of the values from a given list of tags


DataFrame

I have this dataframe which contains user id and tags related to the user. what is the best way to filter out only those rows which has tags containing any one among this list. data_science = ['python', 'r', 'matlab', 'sas', 'excel', 'sql'] I have tried the below code in pandas, it does filter out to a certain extent but it gives tags which has any tag similar to the list. for example, for sql, it throws sql-server. Could you suggest a better approach?

df_ds = df_combo[df_combo["Tag"].astype(str).str.contains('(python|excel|sql|matlab)', regex=True)]

Solution

  • I think a possible simpler approach, but perhaps verbose is this:

    # create a set with the queried tags
    tags = {'python', 'r', 'matlab', 'sas', 'excel', 'sql'}
    # create an auxiliary column where all the tags are separated elements of a set 
    df_combo['Tag-set'] = df_combo['Tag'].str.split(',').apply(lambda x: [e.strip() for e in x]).tolist() 
    # use sets to check the intersection
    df_combo['Tag-set'] = df_combo['Tag-set'].apply(set)
    # filter the list
    df_fd = df_combo[df_combo['Tag-set'].apply(lambda x: len(x & tags) > 0)]
    

    The idea is to clean all the strings using split and strip, and then keep only those where the intersection has at least one element.