Search code examples
python-3.xpandasdataframemulti-indexlevels

Getting the keys out of a Multiindex?


I've got a MultiIndex with IDs and Dates, of the form:

MultiIndex(levels=[[196003, 196005, 196007, 196009, 196012, 196103, 196105, 196107, 196109, 196112, 196203, 196205, 196207, 196209, 196212, 196303, 196305, 196307, 196309, 196312, 196403, 196405, 196407, 196409, 196412, 201705, 201707, 201709, 201712, 201803, 201805, 201807, 201809, 201812], ['1959-07-01', '1959-07-02', '1959-07-06', '1959-07-07', '1959-07-08', '1959-07-09', '1959-07-10', '1959-07-13', '1959-07-14', '1959-07-15', '1959-07-16', '1959-07-17', '1959-07-20', '1959-07-21', '1959-07-22', '1959-07-23', ...]])

Both ID & Date are required to specify a row uniquely.

What I want to do is extract the first level of the index.

When I do df.index[0], I get a tuple of the form (196003, '1959-07-01')

What I want is a Series of keys of the form [196003, 196005, ...] for level 0.

I managed to get it with:

list(df[~df['ID'].duplicated()]['ID'].sort_values().reset_index()['ID'])

but I perceive this to be a messy & slow solution.

What's the pandas-way?


Solution

  • I think you can use get_level_values with unique:

    import pandas as pd
    
    df = pd.DataFrame({'ID':[1,1,3],
                       'Dates':['2015-01-01','2015-01-01','2015-02-01'],
                       'C':[7,8,9]})
    df['Dates'] = pd.to_datetime(df.Dates)
    df.set_index(['ID', 'Dates'], inplace=True)
    print (df)
                   C
    ID Dates        
    1  2015-01-01  7
       2015-01-01  8
    3  2015-02-01  9
    
    print (df.index.get_level_values('ID').unique().tolist())
    [1, 3]
    
    #another a bit slowier solution
    print (df.index.get_level_values('ID').drop_duplicates().tolist())
    [1, 3]
    

    Timings:

    In [134]: %timeit (orig(df1))
    1000 loops, best of 3: 1.54 ms per loop
    
    In [138]: %timeit (df.index.get_level_values('ID').unique().tolist())
    10000 loops, best of 3: 131 µs per loop
    
    In [139]: %timeit (df.index.get_level_values('ID').drop_duplicates().tolist())
    10000 loops, best of 3: 182 µs per loop
    

    Code for timings:

    len(df) = 3k:

    import pandas as pd
    
    df = pd.DataFrame({'ID':[1,1,3],
                       'Dates':['2015-01-01','2015-01-01','2015-02-01'],
                       'C':[7,8,9]})
    df = pd.concat([df]*1000).reset_index(drop=True)
    df['Dates'] = pd.to_datetime(df.Dates)
    df.set_index(['ID', 'Dates'], inplace=True)
    print (df)
    
    
    df1 = df.copy()
    df1.reset_index('ID', inplace=True)
    
    def orig(df):
    
        return list(df[~df['ID'].duplicated()]['ID'].sort_values().reset_index()['ID'])
    
    print (df.index.get_level_values('ID').unique().tolist())
    
    print (orig(df1))
    
    print (df.index.get_level_values('ID').drop_duplicates().tolist())