Search code examples
python-3.xpandassortingmulti-indexlexicographic

MultiIndex DataFrame sorting order after columns stacking to Series


Problems during sort_index after DataFrame stack application:

I have an MS Excel singlesheet file which looks like: link to table screenshot

I'v read it in DataFrame and received an expected result:

df_test = pd.read_excel(io = path_test_xlsx, header = 0, index_col = 0) 
       US  AR   QA
Date                   
2015-01-01   1  10  100
2016-01-01   2  20  200
2017-01-01   3  30  300

Then I stacked it to Series like this:

ser_test = df_test.stack()
ser_test.index.names = ['Date', 'Country']
Date        Country
2015-01-01  US           1
            AR          10
            QA         100
2016-01-01  US           2
            AR          20
            QA         200
2017-01-01  US           3
            AR          30
            QA         300

Next step is to sort Series by Country and then by Date. So I tried this:

print(ser_test.sort_index(level = ['Country', 'Date']))
Date        Country
2015-01-01  US           1
2016-01-01  US           2
2017-01-01  US           3
2015-01-01  AR          10
2016-01-01  AR          20
2017-01-01  AR          30
2015-01-01  QA         100
2016-01-01  QA         200
2017-01-01  QA         300

Seems to me the problem is in stacking procedure, because next list of manipulations drove me to success:

df_test_reseted = ser_test.reset_index(level = 'Country')
ser_test_reseted = df_test_reseted.set_index('Country', append = True).squeeze()
print(ser_test_reseted.sort_index(level = ['Country', 'Date']))
Date        Country
2015-01-01  AR          10
2016-01-01  AR          20
2017-01-01  AR          30
2015-01-01  QA         100
2016-01-01  QA         200
2017-01-01  QA         300
2015-01-01  US           1
2016-01-01  US           2
2017-01-01  US           3

Is stack procedure really leads to ignoring lexicographical order or I've done something wrong?


Solution

  • It seems bug because sorting is by labels, not by levels. If recreate MultiIndex it working nice:

    print (ser_test.index.labels)
    [[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]]
    
    print (ser_test.index.levels)
    [['2015-01-01', '2016-01-01', '2017-01-01'], ['US', 'AR', 'QA']]
    
    ser_test.index = pd.MultiIndex.from_tuples(ser_test.index.tolist(), names=['Date', 'Country'])
    print(ser_test.sort_index(level = ['Country', 'Date']))
    Date        Country
    2015-01-01  AR          10
    2016-01-01  AR          20
    2017-01-01  AR          30
    2015-01-01  QA         100
    2016-01-01  QA         200
    2017-01-01  QA         300
    2015-01-01  US           1
    2016-01-01  US           2
    2017-01-01  US           3
    dtype: int64
    

    Another idea is use this trick - convert Series to one column DataFrame and select by 0 in last step:

    print(ser_test.to_frame().sort_index(level=['Country', 'Date'])[0])
    Date        Country
    2015-01-01  AR          10
    2016-01-01  AR          20
    2017-01-01  AR          30
    2015-01-01  QA         100
    2016-01-01  QA         200
    2017-01-01  QA         300
    2015-01-01  US           1
    2016-01-01  US           2
    2017-01-01  US           3
    Name: 0, dtype: int64