I have a dataset:
id key value
24 Apple Inc_Desktops revenue_rgs_category_-_pc_monitors nan
2 Apple Inc_Desktops revenue_rgs_category_-_mobile_phones 142381000000.000
46 Apple Inc_Desktops revenue_rgs_category_-_smart_tech 24482000000.000
13 Apple Inc_Desktops revenue_rgs_category_-_desktop_pcs 12870000000.000
35 Apple Inc_Desktops revenue_rgs_category_-_tablets 21280000000.000
1 Apple Inc_Laptops revenue_rgs_category_-_mobile_phones 142381000000.000
45 Apple Inc_Laptops revenue_rgs_category_-_smart_tech 24482000000.000
23 Apple Inc_Laptops revenue_rgs_category_-_pc_monitors nan
34 Apple Inc_Laptops revenue_rgs_category_-_tablets 21280000000.000
12 Apple Inc_Laptops revenue_rgs_category_-_desktop_pcs 12870000000.000
25 Apple Inc_MobilePhones revenue_rgs_category_-_pc_monitors nan
14 Apple Inc_MobilePhones revenue_rgs_category_-_desktop_pcs 12870000000.000
36 Apple Inc_MobilePhones revenue_rgs_category_-_tablets 21280000000.000
47 Apple Inc_MobilePhones revenue_rgs_category_-_smart_tech 24482000000.000
3 Apple Inc_MobilePhones revenue_rgs_category_-_mobile_phones 142381000000.000
And I only want to keep the rows when the column key
contains a substring from column id
. For example, as illustrated in the picture below, i want to keep only rows with index 13,3 because for those rows the 'key' column contains part of the id
column - eg, for row with index
3, 'Mobile' is included in key
column.
So my desired output would be:
id key value
13 Apple Inc_Desktops revenue_rgs_category_-_desktop_pcs 12870000000.000
3 Apple Inc_MobilePhones revenue_rgs_category_-_mobile_phones 142381000000.000
I tried to create a new indicating whether the 'key' column contains substring of the 'id' column, but with not luck:
comp_rev_long['check'] = comp_rev_long['key'].str.contains('|'.join(comp_rev_long['id']),case=False)
Any ideas on an efficient way to do this? Thanking you in advance.
Here is some code that should help you get started:
import numpy as np
import pandas as pd
np.random.seed(1)
# I create a simple DataFrame
df = pd.DataFrame({"id": np.random.choice(["apple", "banana", "cherry"], 15),
"key": np.random.choice(["apple pie", "banana pie", "cherry pie"], 15),
"value": np.random.randint(0,20, 15)})
df
looks like this:
id key value
0 banana cherry pie 13
1 apple banana pie 9
2 apple cherry pie 9
3 banana apple pie 7
4 banana apple pie 1
5 apple cherry pie 0
6 apple apple pie 17
7 banana banana pie 8
8 apple cherry pie 13
9 banana cherry pie 19
10 apple apple pie 15
11 cherry banana pie 10
12 banana banana pie 8
13 cherry cherry pie 7
14 apple apple pie 3
Here is a simple option to select only the rows that satisfy a certain condition.
# create a function that checks if a row satisfies your condition
check_condition = lambda row: row["id"] in row["key"]
# create a new column that determines whether you keep the row
# by applying the check_condition function row wise (-> axis=1)
df["keep_row"] = df.apply(check_condition, axis=1)
# finally select and keep only the desired rows
df = df[df["keep_row"]]
Now df
looks like this:
id key value keep_row
6 apple apple pie 17 True
7 banana banana pie 8 True
10 apple apple pie 15 True
12 banana banana pie 8 True
13 cherry cherry pie 7 True
14 apple apple pie 3 True
One final issue is how to check if a substring is contained in another string. There are a few ways to go about this.
row["id"] in row["key"]
mobile
or pc
make a new 'device' column.This check_condition
might work, form seeing your data but I cannot be sure of course.
def check_condition(row):
for i in row["id"].lower().split('_'):
if i in row["key"].lower():
return True
elif i[:-1] in row["key"].lower(): # account for the final 's'
return True
return False
2 notes:
check_condition
-function by this one.