Consider having two dataframes having the same column a
. However in the first dataframe column a
has unique values, whereas in the second one it does not although the possible values of column are the same in both dataframes and both of them are sorted on a
.
df = pd.DataFrame([{'a': 1, 'b':2}, {'a':2, 'b': 4}, {'a':3, 'b': 4}])
df2 = pd.DataFrame([{'a': 1, 'c':2}, {'a': 1, 'c':3}, {'a':2, 'c': 4}, {'a': 2, 'c':5}, {'a': 3, 'c':5}, {'a': 3, 'c':5}])
From the first dataframe I want to consider only a given set of rows, based on column b
. For example:
my_indexes = df1[df1['b']==4]
the result my_indexes
is a pd.Series
of booleans (not exactly indexes but let me keep this naming), and normally in a dataframe it can be used as df[my_indexes]
to extract some rows. I want to use it on df2
to extract the rows with values a
that corresponds to that for which 'b'==4
, i.e., the ones for which valid
entries are True. However, df2
has more rows than df
. For this reason I first group on a
as:
grouped = df2.groupby("a")
in this way grouped
will have the same length of my_indexes
, since all values of column a
in df
are also in df2
and since both are sorted on a
. So I thought I could use valid
directly on groupded
as it were a df, but grouped[my_indexes]
does not work.
How can I use my_indexes
to filter df2
?
First filter df2
based on values in df['a']
where the condition df['a'] == 1
is satisfied and then perform the groupby operation on the filtered df. Basically:
import pandas as pd
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 2, 'b': 4}])
df2 = pd.DataFrame([{'a': 1, 'c': 2}, {'a': 1, 'c': 3}, {'a': 2, 'c': 4}, {'a': 2, 'c': 5}])
values_of_interest = df[df['a'] == 1]['a']
filtered_df2 = df2[df2['a'].isin(values_of_interest)]
grouped_df = filtered_df2.groupby('a')
AS an example:
for name, group in grouped_df:
print(f'Group: {name}')
print(group)
Which gives
Group: 1
a c
0 1 2
1 1 3