Search code examples
pythonpandasdataframeindexingmulti-index

How can I get the index values in DF1 to where DF1's column values match DF2's custom multiindex values?


I have two data frames: DF1 and DF2.
DF2 is essentially a randomly generated subset of rows in DF1. I want to get the (integer) indexes of DF1 of the rows where there is a complete match of all column values in DF1.
I'm trying to do this with a multi-index: So if I have the following:

DF1:

Index  Name  Age   Gender  Label

0      Kate   24     F     1

1      Bill   23     M     0

2      Bob    22     M     0

3      Billy  21     M     0

DF2:

MultiIndex      Name  Age   Gender   Label

(Bob,22,M)      Bob    22     M      0

(Billy,21,M)    Billy  21     M      0

Desired Output: [2,3]

How can I use that MultiIndex in DF2 to check DF1 for those matches? I found this while searching but I think this requires you to specify what value you want beforehand? I can't find this exact use case.

df2.loc[(df2.index.get_level_values("Name" =='xxx') & 
        (df2.index.get_level_values('Age') == x & 
        (df2.index.get_level_values('Gender') == x)]

Please let me know the best way. Thanks!

Edit (Code to generate df1): Pseudocode: Merge two dataframes to get a total of 10 columns and drop everything except 4 columns Edit (Code to generate df2):

if amount_needed - len(lowest_value_keys) > 0: 
    extra_samples = df1[df1.Label==0].sample(n=amount_needed -len(lowest_value_keys) ,replace=False)
    lowest_value_df = pd.DataFrame(data = lower_value_keys, columns = ["Name", 'Age','Gender'])
    samples = pd.concat([lowest_value_df, extra_samples])
    samples.index = pd.MultiIndex.from_frame(samples [["Name", 'Age','Gender']])

else:  
    all_samples  = pd.DataFrame(data = lower_value_keys, columns = ["Name", 'Age','Gender']) 
    samples = all_samples.sample(n=amount_needed,replace=False)
    samples.index = pd.MultiIndex.from_frame(samples [["Name", 'Age','Gender']])

Solution

  • Not sure if this answers your query, but if we first reset the index of df1 to get that as another column 'Index', and then set_index on Name, Age , Gender to find the matches on df2 and just take the resulting Index column would that work ?

    So that would be:

    df1.reset_index().set_index(['Name','Age','Gender']).loc[df2.set_index(['Name','Age','Gender']).index]['Index'].values