Search code examples
pythonpandaspandas-groupbymulti-index

Get values from dataframe with MultiIndex index containg NaNs


I cannot access the values of an index position that has a nan in it and wonder how I could solve this. (In my project this index has a very special meaning and I really need to keep it, otherwise I would need to make some dirty manual modifications: "there is always a solution" even if it is a very bad one).

df
Out
temp_playlist  objId
0              o1           [0, 6]
               o2           [1, 4]
               o3           [2, 5]
               o4       [8, 9, 12]
               o5         [10, 13]
               o6         [11, 14]
               NaN          [3, 7]
Name: x, dtype: object

df.index
Out
MultiIndex([(0, 'o1'),
            (0, 'o2'),
            (0, 'o3'),
            (0, 'o4'),
            (0, 'o5'),
            (0, 'o6'),
            (0,  nan)],
           names=['temp_playlist', 'objId'])

Now I want to access the [3, 7] values as df.loc[(0, np.nan)] and obtain the KeyError: (0, nan) error.

Just to put it in perspective: [df.loc[idx] for idx in df.index if not pd.isna(idx[1])] works properly because I am skipping the problematic index.

What am I missing and how could I solve this?

(Windows 10, python 3.8.5, pandas 1.3.1, numpy 1.20.3, reported to pandas here)


Solution

  • Update

    I am able to reproduce your error after grouping and aggregating a data frame.

    >>> import pandas as pd
    >>> data = pd.DataFrame({
    ...     "temp_playlist": [0] * 15,
    ...     "objId": ['o1'] * 2 + ['o2'] * 2 + ['o3'] * 2 + ['o4'] * 3 + ['o5'] * 2 + ['o6'] * 2 + [pd.NA] * 2,
    ...     "vals": [0, 6, 1, 4, 2, 5, 8, 9, 12, 10, 13, 11, 14, 3, 7]
    ... })
    >>> df = data.groupby(["temp_playlist", "objId"], dropna=False).agg(list)
    >>> df.loc[(0, pd.NA)]
    Traceback (most recent call last):
      File "/home/ec2-user/miniconda3/envs/so-pandas-nan-index/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
        return self._engine.get_loc(casted_key)
      File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
      File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
    KeyError: <NA>
    

    Passing in an explit MultiIndex works, though.

    >>> df.loc[pd.MultiIndex.from_tuples([(0, pd.NA)], names=["temp_playlist", "objId"])]
                           vals
    temp_playlist objId
    0             NaN    [3, 7]
    
    >>> df.loc[pd.MultiIndex.from_tuples([(0, pd.NA)])]
             vals
    0 NaN  [3, 7]
    

    And so does returning a data frame using a single tuple. Note using [[]] returns a DataFrame.

    >>> df.loc[[(0, pd.NA)]]
                           vals
    temp_playlist objId
    0             NaN    [3, 7]
    

    As does DataFrame.reindex (see also the user guide on reindexing).

    >>> df.reindex([(0, pd.NA)])
                           vals
    temp_playlist objId
    0             NaN    [3, 7]
    

    Original Attempt to Reproduce Error

    I am not able to reproduce your error. You can see below that using df.loc[(0, np.nan)] works.

    Python 3.8.5 (default, Sep  4 2020, 07:30:14)
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import numpy as np
    >>> import pandas as pd
    >>> nan_index = pd.MultiIndex.from_tuples([(0, 'o1'),
                (0, 'o2'),
                (0, 'o3'),
                (0, 'o4'),
                (0, 'o5'),
                (0, 'o6'),
                (0,  np.nan)])
    >>> print(nan_index)
    MultiIndex([(0, 'o1'),
                (0, 'o2'),
                (0, 'o3'),
                (0, 'o4'),
                (0, 'o5'),
                (0, 'o6'),
                (0,  nan)],
               )
    >>> rng = np.random.default_rng(42)
    >>> vals = [rng.choice(20, 2) for i in range(nan_index.shape[0])]
    >>> print(vals)
    [array([ 1, 15]), array([13,  8]), array([ 8, 17]), array([ 1, 13]), array([4, 1]), array([10, 19]), array([14, 15])]
    >>> df = pd.DataFrame({"vals": vals}, index=nan_index)
    >>> print(df)
               vals
    0 o1    [1, 15]
      o2    [13, 8]
      o3    [8, 17]
      o4    [1, 13]
      o5     [4, 1]
      o6   [10, 19]
      NaN  [14, 15]
    >>> print(df.loc[(0, 'o1')])
    vals    [1, 15]
    Name: (0, o1), dtype: object
    >>> print(df.loc[(0, np.nan)])
    vals    [14, 15]
    Name: (0, nan), dtype: object
    >>> print(pd.__version__)
    1.3.1
    

    Then I noticed that your index was printed as (0, nan) but mine was (0, np.nan). The difference was that I used np.nan and I suspect yours is pd.NA.

    >>> nan_index = pd.MultiIndex.from_tuples([(0, 'o1'),
                (0, 'o2'),
                (0, 'o3'),
                (0, 'o4'),
                (0, 'o5'),
                (0, 'o6'),
                (0,  pd.NA)])
    >>> nan_index
    MultiIndex([(0, 'o1'),
                (0, 'o2'),
                (0, 'o3'),
                (0, 'o4'),
                (0, 'o5'),
                (0, 'o6'),
                (0,  nan)],
               )
    >>> df = pd.DataFrame({"vals": vals}, index=nan_index)
    >>> df
               vals
    0 o1    [1, 15]
      o2    [13, 8]
      o3    [8, 17]
      o4    [1, 13]
      o5     [4, 1]
      o6   [10, 19]
      NaN  [14, 15]
    

    However, that did not resolve the difference. I was still able to use df.loc[(0, np.nan)].

    >>> df.loc[(0, pd.NA)]
    vals    [14, 15]
    Name: (0, nan), dtype: object
    
    >>> df.loc[(0, np.nan)]
    vals    [14, 15]
    Name: (0, nan), dtype: object
    

    Moreover, I was also able to use df.loc[(0, None)].

    >>> df.loc[(0, None)]
    vals    [14, 15]
    Name: (0, nan), dtype: object
    

    Just to confirm, np.nan, pd.NA, and None are all different objects. Pandas must treat them the same when used with DataFrame.loc.

    >>> pd.NA is np.nan
    False
    
    >>> pd.NA is None
    False
    
    >>> np.nan is None
    False
    
    >>> type(pd.NA)
    <class 'pandas._libs.missing.NAType'>
    
    >>> type(np.nan)
    <class 'float'>