Search code examples
pythonpandasdataframeexceptionkeyerror

pandas DataFrame KeyError: get list of missing keys


I am trying to select from the data frame df the rows that match a list of keys IDs.

df.loc[IDs]

In rare cases they keys are not contained in the DataFrame and a KeyError is thrown:

KeyError: "None of [['001U000001c6OczIAE' '001U000000fgVR9IAM' '0015800000ecNcjAAE'\n '001U000000fgVRDIA2']] are in the [index]"

Can I easily get to the list of missing keys from that exception? Would it be better/cleaner/more pythonic to not do this by exception handling?


Solution

  • Without more info it looks like you may have forgotten a step- have you set your index to the Salesforce ID (assuming that's what it is)?

    For example (random data):

    df
        a account
    0   1     abc
    1   3     abc
    2   5     abc
    3   7     def
    4   7     def
    5  34      gf
    6   3      hj
    7  24      hj
    
    lis = ['abc', 'hj']
    df.loc[lis]
    KeyError: "None of [['abc', 'hj']] are in the [index]"
    

    For pandas versions prior to 0.21.0

    After setting index:

    df.set_index('account').loc[lis]
              a
    account    
    abc       1
    abc       3
    abc       5
    hj        3
    hj       24
    

    Missing values should not throw an error, rather np.nan value:

    lis = ['abc', 'hj', 'j']
    df.set_index('account').loc[lis]
                a
    account      
    abc       1.0
    abc       3.0
    abc       5.0
    hj        3.0
    hj       24.0
    j         NaN
    

    For pandas versions 0.21.0+

    You will need to instead use the dataframe method reindex(). With reindex, however, you will not be able to have duplicates in the index (so my example above will not work without some deduplication):

    df.set_index('account')\
    .groupby(level=0).first()\  # de-duplicate index here
    .reindex(lis2)
    
               a
    account     
    abc      1.0
    hj       3.0
    j        NaN
    

    Unfortunately using reindex() after deduplicating there is data loss if you're trying to continue working with the dataframe. Regardless, this is probably not the best way to find identifiers missing from your dataframe.