pandas substring rows contains similarity

Is there a way to keep only rows in a DataFrame, when a column of that dataframe contains a substring of another column in that dataframe?

I have a dataset:

                                  id                                   key            value
24                Apple Inc_Desktops    revenue_rgs_category_-_pc_monitors              nan
2                 Apple Inc_Desktops  revenue_rgs_category_-_mobile_phones 142381000000.000
46                Apple Inc_Desktops     revenue_rgs_category_-_smart_tech  24482000000.000
13                Apple Inc_Desktops    revenue_rgs_category_-_desktop_pcs  12870000000.000
35                Apple Inc_Desktops        revenue_rgs_category_-_tablets  21280000000.000
1                  Apple Inc_Laptops  revenue_rgs_category_-_mobile_phones 142381000000.000
45                 Apple Inc_Laptops     revenue_rgs_category_-_smart_tech  24482000000.000
23                 Apple Inc_Laptops    revenue_rgs_category_-_pc_monitors              nan
34                 Apple Inc_Laptops        revenue_rgs_category_-_tablets  21280000000.000
12                 Apple Inc_Laptops    revenue_rgs_category_-_desktop_pcs  12870000000.000
25            Apple Inc_MobilePhones    revenue_rgs_category_-_pc_monitors              nan
14            Apple Inc_MobilePhones    revenue_rgs_category_-_desktop_pcs  12870000000.000
36            Apple Inc_MobilePhones        revenue_rgs_category_-_tablets  21280000000.000
47            Apple Inc_MobilePhones     revenue_rgs_category_-_smart_tech  24482000000.000
3             Apple Inc_MobilePhones  revenue_rgs_category_-_mobile_phones 142381000000.000

And I only want to keep the rows when the column key contains a substring from column id. For example, as illustrated in the picture below, i want to keep only rows with index 13,3 because for those rows the 'key' column contains part of the id column - eg, for row with index 3, 'Mobile' is included in key column.

So my desired output would be:

                                  id                                   key            value
13                Apple Inc_Desktops    revenue_rgs_category_-_desktop_pcs  12870000000.000
3             Apple Inc_MobilePhones  revenue_rgs_category_-_mobile_phones 142381000000.000

I tried to create a new indicating whether the 'key' column contains substring of the 'id' column, but with not luck:

comp_rev_long['check'] = comp_rev_long['key'].str.contains('|'.join(comp_rev_long['id']),case=False)

Any ideas on an efficient way to do this? Thanking you in advance.

Solution

Here is some code that should help you get started:

import numpy as np
import pandas as pd

np.random.seed(1)
# I create a simple DataFrame
df = pd.DataFrame({"id": np.random.choice(["apple", "banana", "cherry"], 15),
                   "key":  np.random.choice(["apple pie", "banana pie", "cherry pie"], 15),
                   "value":    np.random.randint(0,20, 15)})

df looks like this:

        id         key  value
0   banana  cherry pie     13
1    apple  banana pie      9
2    apple  cherry pie      9
3   banana   apple pie      7
4   banana   apple pie      1
5    apple  cherry pie      0
6    apple   apple pie     17
7   banana  banana pie      8
8    apple  cherry pie     13
9   banana  cherry pie     19
10   apple   apple pie     15
11  cherry  banana pie     10
12  banana  banana pie      8
13  cherry  cherry pie      7
14   apple   apple pie      3

Here is a simple option to select only the rows that satisfy a certain condition.

# create a function that checks if a row satisfies your condition
check_condition = lambda row: row["id"] in row["key"]

# create a new column that determines whether you keep the row
# by applying the check_condition function row wise (-> axis=1)
df["keep_row"] =  df.apply(check_condition, axis=1)

# finally select and keep only the desired rows 
df = df[df["keep_row"]]

Now df looks like this:

        id         key  value  keep_row
6    apple   apple pie     17      True
7   banana  banana pie      8      True
10   apple   apple pie     15      True
12  banana  banana pie      8      True
13  cherry  cherry pie      7      True
14   apple   apple pie      3      True

One final issue is how to check if a substring is contained in another string. There are a few ways to go about this.

Replace the values such that this operation becomes trivial, eg. row["id"] in row["key"]
Make new columns with the crucial information of the sting, if you only need to know whether is a mobile or pc make a new 'device' column.
Just code it anyway, this a a bit cumbersome though

This check_condition might work, form seeing your data but I cannot be sure of course.

def check_condition(row):
    for i in row["id"].lower().split('_'):
        if i in row["key"].lower():
            return True
        elif i[:-1] in row["key"].lower(): # account for the final 's'
            return True
    return False

2 notes:

This isn't a lambda function, but in this case it is equivalent to one, so you can replace the lambda check_condition-function by this one.
Also note that in the "id" and "key" columns some words ends with '-s' and some don't so that needs to be accounted for as well.