Search code examples
pandasdataframemulti-index

How can I select rows from one DataFrame, where a part of the row's index is in another DataFrame's index and meets certain criteria?


I have two DataFrames. df provides a lot of data. test_df describes whether certain tests have passed or not. I need to select from df only the rows where the tests have not failed by looking up this info in test_df. So far, I'm able to reduce my test_df to passed_tests. So, what's left is to select only the rows from df where the relevant part of the row index is in passed_tests. How can I do that?

Updates:

  • test_db doesn't haven't unique rows. Where there are duplicate rows (and there may be more than 1 duplicate), the test that was the most positive takes priority. i.e True > Ok > False.

My code:

import pandas as pd
import numpy as np


index = [np.array(['foo', 'foo', 'foo', 'foo', 'qux', 'qux', 'qux']), np.array(['a', 'a', 'b', 'b', 'a', 'b', 'b'])]
data = np.array(['False', 'True', 'False', 'False', 'False', 'Ok', 'False'])
columns = ["Passed?"]
test_df = pd.DataFrame(data, index=index, columns=columns)
print test_df

index = [np.array(['foo', 'foo', 'foo', 'foo', 'qux', 'qux', 'qux', 'qux']),
         np.array(['a', 'a', 'b', 'b', 'a', 'a', 'b', 'b']),
         np.array(['1', '2', '1', '2', '1', '2', '1', '2'])]
data = np.random.randn(8, 2)
columns = ["X", "Y"]
df = pd.DataFrame(data, index=index, columns=columns)
print df

passed_tests = test_df.loc[test_df['Passed?'].isin(['True', 'Ok'])]
print passed_tests

df

                X         Y
foo a 1  0.589776 -0.234717
      2  0.105161  1.937174
    b 1 -0.092252  0.143451
      2  0.939052 -0.239052
qux a 1  0.757239  2.836032
      2 -0.445335  1.352374
    b 1  2.175553 -0.700816
      2  1.082709 -0.923095

test_df

     Passed?
foo a   False
    a    True
    b   False
    b   False
qux a   False
    b      Ok
    b   False

passed_tests

      Passed?
foo a    True
qux b      Ok

required solution

                X         Y
foo a 1  0.589776 -0.234717
      2  0.105161  1.937174
qux b 1  2.175553 -0.700816
      2  1.082709 -0.923095

Solution

  • You need reindex with method='ffill', then check values by isin and last use boolean indexing:

    print (test_df.reindex(df.index, method='ffill'))
            Passed?
    foo a 1    True
          2    True
        b 1   False
          2   False
    qux a 1   False
          2   False
        b 1      Ok
          2      Ok
    
    mask = test_df.reindex(df.index, method='ffill').isin(['True', 'Ok'])['Passed?']
    print (mask)
    foo  a  1     True
            2     True
         b  1    False
            2    False
    qux  a  1    False
            2    False
         b  1     True
            2     True
    Name: Passed?, dtype: bool
    
    print (df[mask])
                    X         Y
    foo a 1 -0.580448 -0.168951
          2 -0.875165  1.304745
    qux b 1 -0.147014 -0.787483
          2  0.188989 -1.159533
    

    EDIT:

    For remove duplicates here is the easier use:

    test_df = test_df.reset_index()
                     .sort_values(['level_0','level_1', 'Passed?'], ascending=[1,1,0])
                     .drop_duplicates(['level_0','level_1'])
                     .set_index(['level_0','level_1'])
                     .rename_axis([None, None])
    print (test_df)
          Passed?
    foo a    True
        b   False
    qux a   False
        b      Ok
    

    Another solution is simplier - sorting first and then groupby with first:

    test_df = test_df.sort_values('Passed?', ascending=False)
                     .groupby(level=[0,1])
                     .first()
    print (test_df)
          Passed?
    foo a    True
        b   False
    qux a   False
        b      Ok
    

    EDIT1:

    Convert values to ordered Categorical.

    index = [np.array(['foo', 'foo', 'foo', 'foo', 'qux', 'qux', 'qux']), np.array(['a', 'a', 'b', 'b', 'a', 'b', 'b'])]
    data = np.array(['False', 'True', 'False', 'False', 'False', 'Acceptable', 'False'])
    columns = ["Passed?"]
    test_df = pd.DataFrame(data, index=index, columns=columns)
    #print (test_df)
    
    cat = ['False', 'Acceptable','True']
    test_df["Passed?"] = test_df["Passed?"].astype('category', categories=cat, ordered=True)
    print (test_df["Passed?"])
    foo  a         False
         a          True
         b         False
         b         False
    qux  a         False
         b    Acceptable
         b         False
    Name: Passed?, dtype: category
    Categories (3, object): [False < Acceptable < True]
    
    test_df = test_df.sort_values('Passed?', ascending=False).groupby(level=[0,1]).first()
    print (test_df)
              Passed?
    foo a        True
        b       False
    qux a       False
        b  Acceptable