Search code examples
pythonpython-3.xpandasdataframedata-analysis

A Lexicographical Bug in Pandas?


Please take this question lightly as asked from curiosity:

As I was trying to see how the slicing in MultiIndex works, I came across the following situation ↓

# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])

# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)

Returns:

a  1    5
   2    0
c  1    8
   2    6
b  1    6
   2    3
dtype: int32

NOTE that the indices are not in the sorted order ie. a, c, b is the order which will result in the expected error that we want while slicing.

# When we do slicing
data.loc["a":"c"]

Errors like:

UnsortedIndexError

----> 1 data.loc["a":"c"]
UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

That's expected. But now, after doing the following steps:

# Making a DataFrame
data = data.unstack()

# Redindexing - to unsort the indices like before
data = data.reindex(["a", "c", "b"])

# Which looks like 
   1  2
a  5  0
c  8  6
b  6  3

# Then again making series
data = data.stack()

# Reindex Again!
data = data.reindex(["a", "c", "b"], level=0)


# Which looks like before
a  1    5
   2    0
c  1    8
   2    6
b  1    6
   2    3
dtype: int32

The Problem

So, now the process is: Series → Unstack → DataFrame → Stack → Series

Now, if I do the slicing like before (still on with the indices unsorted) we don't get any error!

# The same slicing
data.loc["a":"c"]

Results without an error:

a  1    5
   2    0
c  1    8
   2    6
dtype: int32

Even if the data.index.is_monotonicFalse. Then still why can we slice?

So the question is: WHY?.

I hope you got the understanding of the situation here. Because see, the same series which was before giving the error, after the unstack and stack operation is not giving any error.

So is that a bug, or a new concept that I am missing here?

Thanks!
Aayush ∞ Shah

UPDATE: I have used the data.reindex() so to unsort that once more. Please have a look at it again.


Solution

  • The difference between your 2 dataframes is the following:

    index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
    
    data = pd.Series(np.random.randint(10, size=6), index=index)
    
    data2 = data.unstack().reindex(["a", "c", "b"]).stack()
    
    >>> data.index.codes
    FrozenList([[0, 0, 2, 2, 1, 1], [0, 1, 0, 1, 0, 1]])
    
    >>> data2.index.codes
    FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
    

    Even if your two indexes are the same appearance (values), the internal index (codes) are differents.

    Check this method of MultiIndex:

            Create a new MultiIndex from the current to monotonically sorted
            items IN the levels. This does not actually make the entire MultiIndex
            monotonic, JUST the levels.
    
            The resulting MultiIndex will have the same outward
            appearance, meaning the same .values and ordering. It will also
            be .equals() to the original.
    

    Old answer

    # Making a DataFrame
    data = data.unstack()
    
    # Which looks like         # <- WRONG
       1  2                    #    1  2
    a  5  0                    # a  8  0
    c  8  6                    # b  4  1
    b  6  3                    # c  7  6
    
    # Then again making series
    data = data.stack()
    
    # Which looks like before  # <- WRONG
    a  1    5                  # a  1    2
       2    0                  #    2    1
    c  1    8                  # b  1    0
       2    6                  #    2    1
    b  1    6                  # c  1    3
       2    3                  #    2    9
    dtype: int32
    

    If you want to use slicing, you have to check if the index is monotonic:

    # Simple MultiIndex Creation
    index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
    
    # Making Series with that MultiIndex
    data = pd.Series(np.random.randint(10, size=6), index=index)
    
    >>> data.index.is_monotonic
    False
    
    >>> data.unstack().stack().index.is_monotonic
    True
    
    >>> data.sort_index().index.is_monotonic
    True