Search code examples
pandas

How to select rows from a Pandas DataFrame using index?


I am trying to select rows from a Pandas DataFrame, using the integer index values.

This does not work, and I obtain out of index errors.

  • This suggests to me that performing a selection of rows by index implicitly causes reset_index() to be called, although I may be mistaken
  • The following example explains why the behaviour I observe suggests this to be the case
import pandas

data = {
    'number': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'fruit': 3 * ['apple'] + 3 * ['pear'] + 2 * ['banana'] + ['pear'] + ['apple'],
    'color': 3 * ['red', 'green', 'blue'] + ['red'],
    'letter': 5 * ['A', 'B'],
}
df = pandas.DataFrame(data)
df

Pandas DataFrame

df_selected = df[df['fruit'] == 'pear']
df_selected

Pandas DataFrame

df_selected.index
Index([3, 4, 5, 8], dtype='int64')

This certainly suggests I have a DataFrame with an Index containing the values 3, 4, 5 and 8.

I now want to select all the rows in the DataFrame from the first occurance of 'pear' to the last occurance, by using the integer index:

I thought this should be possible with the following syntax:

FIRST = 3
LAST = 8
df_selected[FIRST:LAST+1]

But I am mistaken:

Pandas DataFrame

  • When printing (displaying the DataFrame to stdout or a Jupyter Notebook Cell) the index shows values 3, 4, 5, 8.
  • When selecting by index using the syntax df_selected[A:B] or df_selected.iloc[A:B] the integer arguments A and B are interpreted as if df_selected.reset_index() had been called
  • I say this, because calling reset_index() produces the following output

Pandas DataFrame

Which implies the correct range to use when selecting by index is df_selected.iloc[0:3+1]

I am aware this is an incredibly basic question but I'm hoping someone can point me in the right direction as to understanding why the behaviour is this way, if there is a particular reason for it.


Solution

  • You should use loc to select by label:

    FIRST = 3
    LAST = 8
    
    df_selected = df[df['fruit'] == 'pear']
    out = df.loc[FIRST:LAST]
    

    Or:

    idx = df[df['fruit'] == 'pear'].index
    out = df.loc[idx.min():idx.max()]
    

    NB. since loc includes both ends, you do not need the +1.

    Output:

       number   fruit  color letter
    3       4    pear    red      B
    4       5    pear  green      A
    5       6    pear   blue      B
    6       7  banana    red      A
    7       8  banana  green      B
    8       9    pear   blue      A
    

    When using df_selected[1:2], this behaves like iloc and selects from the fourth to the ninth position (so just the row with label 8).

    alternative

    If your goal is to select all values between the first and last match, you could also use boolean indexing:

    m = df['fruit'].eq('pear')
    out = df[m.cummax()&m[::-1].cummax()]
    

    How it works:

       number   fruit  color letter      m  m.cummax()  m[::-1].cummax()      &
    0       1   apple    red      A  False       False              True  False
    1       2   apple  green      B  False       False              True  False
    2       3   apple   blue      A  False       False              True  False
    3       4    pear    red      B   True        True              True   True
    4       5    pear  green      A   True        True              True   True
    5       6    pear   blue      B   True        True              True   True
    6       7  banana    red      A  False        True              True   True
    7       8  banana  green      B  False        True              True   True
    8       9    pear   blue      A   True        True              True   True
    9      10   apple    red      B  False        True             False  False