Search code examples
pythonpython-3.xpandasindexingreindex

Difference between df.reindex() and df.set_index() methods in pandas


I was confused by this, which is very simple but I didn't immediately find the answer on StackOverflow:

  • df.set_index('xcol') makes the column 'xcol' become the index (when it is a column of df).

  • df.reindex(myList), however, takes indexes from outside the dataframe, for example, from a list named myList that we defined somewhere else.

However, df.reindex(myList) also changes values to NAs. A simple alternative is: df.index = myList

I hope this post clarifies it! Additions to this post are also welcome!


Solution

  • You can see the difference on a simple example. Let's consider this dataframe:

    df = pd.DataFrame({'a': [1, 2],'b': [3, 4]})
    print (df)
       a  b
    0  1  3
    1  2  4
    

    Indexes are then 0 and 1

    If you use set_index with the column 'a' then the indexes are 1 and 2. If you do df.set_index('a').loc[1,'b'], you will get 3.

    Now if you want to use reindex with the same indexes 1 and 2 such as df.reindex([1,2]), you will get 4.0 when you do df.reindex([1,2]).loc[1,'b']

    What happend is that set_index has replaced the previous indexes (0,1) with (1,2) (values from column 'a') without touching the order of values in the column 'b'

    df.set_index('a')
       b
    a   
    1  3
    2  4
    

    while reindex change the indexes but keeps the values in column 'b' associated to the indexes in the original df

    df.reindex(df.a.values).drop('a',1) # equivalent to df.reindex(df.a.values).drop('a',1)
         b
    1  4.0
    2  NaN
    # drop('a',1) is just to not care about column a in my example
    

    Finally, reindex change the order of indexes without changing the values of the row associated to each index, while set_index will change the indexes with the values of a column, without touching the order of the other values in the dataframe