Search code examples
pandassubstringrowscontainssimilarity

Is there a way to keep only rows in a DataFrame, when a column of that dataframe contains a substring of another column in that dataframe?


I have a dataset:

                                  id                                   key            value
24                Apple Inc_Desktops    revenue_rgs_category_-_pc_monitors              nan
2                 Apple Inc_Desktops  revenue_rgs_category_-_mobile_phones 142381000000.000
46                Apple Inc_Desktops     revenue_rgs_category_-_smart_tech  24482000000.000
13                Apple Inc_Desktops    revenue_rgs_category_-_desktop_pcs  12870000000.000
35                Apple Inc_Desktops        revenue_rgs_category_-_tablets  21280000000.000
1                  Apple Inc_Laptops  revenue_rgs_category_-_mobile_phones 142381000000.000
45                 Apple Inc_Laptops     revenue_rgs_category_-_smart_tech  24482000000.000
23                 Apple Inc_Laptops    revenue_rgs_category_-_pc_monitors              nan
34                 Apple Inc_Laptops        revenue_rgs_category_-_tablets  21280000000.000
12                 Apple Inc_Laptops    revenue_rgs_category_-_desktop_pcs  12870000000.000
25            Apple Inc_MobilePhones    revenue_rgs_category_-_pc_monitors              nan
14            Apple Inc_MobilePhones    revenue_rgs_category_-_desktop_pcs  12870000000.000
36            Apple Inc_MobilePhones        revenue_rgs_category_-_tablets  21280000000.000
47            Apple Inc_MobilePhones     revenue_rgs_category_-_smart_tech  24482000000.000
3             Apple Inc_MobilePhones  revenue_rgs_category_-_mobile_phones 142381000000.000

And I only want to keep the rows when the column key contains a substring from column id. For example, as illustrated in the picture below, i want to keep only rows with index 13,3 because for those rows the 'key' column contains part of the id column - eg, for row with index 3, 'Mobile' is included in key column.

So my desired output would be:

                                  id                                   key            value
13                Apple Inc_Desktops    revenue_rgs_category_-_desktop_pcs  12870000000.000
3             Apple Inc_MobilePhones  revenue_rgs_category_-_mobile_phones 142381000000.000

I tried to create a new indicating whether the 'key' column contains substring of the 'id' column, but with not luck:

comp_rev_long['check'] = comp_rev_long['key'].str.contains('|'.join(comp_rev_long['id']),case=False)

Any ideas on an efficient way to do this? Thanking you in advance.


Solution

  • Here is some code that should help you get started:

    import numpy as np
    import pandas as pd
    
    np.random.seed(1)
    # I create a simple DataFrame
    df = pd.DataFrame({"id": np.random.choice(["apple", "banana", "cherry"], 15),
                       "key":  np.random.choice(["apple pie", "banana pie", "cherry pie"], 15),
                       "value":    np.random.randint(0,20, 15)})
    

    df looks like this:

            id         key  value
    0   banana  cherry pie     13
    1    apple  banana pie      9
    2    apple  cherry pie      9
    3   banana   apple pie      7
    4   banana   apple pie      1
    5    apple  cherry pie      0
    6    apple   apple pie     17
    7   banana  banana pie      8
    8    apple  cherry pie     13
    9   banana  cherry pie     19
    10   apple   apple pie     15
    11  cherry  banana pie     10
    12  banana  banana pie      8
    13  cherry  cherry pie      7
    14   apple   apple pie      3
    

    Here is a simple option to select only the rows that satisfy a certain condition.

    # create a function that checks if a row satisfies your condition
    check_condition = lambda row: row["id"] in row["key"]
    
    # create a new column that determines whether you keep the row
    # by applying the check_condition function row wise (-> axis=1)
    df["keep_row"] =  df.apply(check_condition, axis=1)
    
    # finally select and keep only the desired rows 
    df = df[df["keep_row"]]
    

    Now df looks like this:

            id         key  value  keep_row
    6    apple   apple pie     17      True
    7   banana  banana pie      8      True
    10   apple   apple pie     15      True
    12  banana  banana pie      8      True
    13  cherry  cherry pie      7      True
    14   apple   apple pie      3      True
    

    One final issue is how to check if a substring is contained in another string. There are a few ways to go about this.

    • Replace the values such that this operation becomes trivial, eg. row["id"] in row["key"]
    • Make new columns with the crucial information of the sting, if you only need to know whether is a mobile or pc make a new 'device' column.
    • Just code it anyway, this a a bit cumbersome though

    This check_condition might work, form seeing your data but I cannot be sure of course.

    def check_condition(row):
        for i in row["id"].lower().split('_'):
            if i in row["key"].lower():
                return True
            elif i[:-1] in row["key"].lower(): # account for the final 's'
                return True
        return False
    

    2 notes:

    1. This isn't a lambda function, but in this case it is equivalent to one, so you can replace the lambda check_condition-function by this one.
    2. Also note that in the "id" and "key" columns some words ends with '-s' and some don't so that needs to be accounted for as well.