Problems during sort_index after DataFrame stack application:
I have an MS Excel singlesheet file which looks like: link to table screenshot
I'v read it in DataFrame and received an expected result:
df_test = pd.read_excel(io = path_test_xlsx, header = 0, index_col = 0)
US AR QA
Date
2015-01-01 1 10 100
2016-01-01 2 20 200
2017-01-01 3 30 300
Then I stacked it to Series like this:
ser_test = df_test.stack()
ser_test.index.names = ['Date', 'Country']
Date Country
2015-01-01 US 1
AR 10
QA 100
2016-01-01 US 2
AR 20
QA 200
2017-01-01 US 3
AR 30
QA 300
Next step is to sort Series by Country and then by Date. So I tried this:
print(ser_test.sort_index(level = ['Country', 'Date']))
Date Country
2015-01-01 US 1
2016-01-01 US 2
2017-01-01 US 3
2015-01-01 AR 10
2016-01-01 AR 20
2017-01-01 AR 30
2015-01-01 QA 100
2016-01-01 QA 200
2017-01-01 QA 300
Seems to me the problem is in stacking procedure, because next list of manipulations drove me to success:
df_test_reseted = ser_test.reset_index(level = 'Country')
ser_test_reseted = df_test_reseted.set_index('Country', append = True).squeeze()
print(ser_test_reseted.sort_index(level = ['Country', 'Date']))
Date Country
2015-01-01 AR 10
2016-01-01 AR 20
2017-01-01 AR 30
2015-01-01 QA 100
2016-01-01 QA 200
2017-01-01 QA 300
2015-01-01 US 1
2016-01-01 US 2
2017-01-01 US 3
Is stack procedure really leads to ignoring lexicographical order or I've done something wrong?
It seems bug because sorting is by labels, not by levels.
If recreate MultiIndex
it working nice:
print (ser_test.index.labels)
[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]]
print (ser_test.index.levels)
[['2015-01-01', '2016-01-01', '2017-01-01'], ['US', 'AR', 'QA']]
ser_test.index = pd.MultiIndex.from_tuples(ser_test.index.tolist(), names=['Date', 'Country'])
print(ser_test.sort_index(level = ['Country', 'Date']))
Date Country
2015-01-01 AR 10
2016-01-01 AR 20
2017-01-01 AR 30
2015-01-01 QA 100
2016-01-01 QA 200
2017-01-01 QA 300
2015-01-01 US 1
2016-01-01 US 2
2017-01-01 US 3
dtype: int64
Another idea is use this trick - convert Series to one column DataFrame
and select by 0
in last step:
print(ser_test.to_frame().sort_index(level=['Country', 'Date'])[0])
Date Country
2015-01-01 AR 10
2016-01-01 AR 20
2017-01-01 AR 30
2015-01-01 QA 100
2016-01-01 QA 200
2017-01-01 QA 300
2015-01-01 US 1
2016-01-01 US 2
2017-01-01 US 3
Name: 0, dtype: int64