I am trying to select rows from a Pandas DataFrame, using the integer index values.
This does not work, and I obtain out of index errors.
reset_index()
to be called, although I may be mistakenimport pandas
data = {
'number': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'fruit': 3 * ['apple'] + 3 * ['pear'] + 2 * ['banana'] + ['pear'] + ['apple'],
'color': 3 * ['red', 'green', 'blue'] + ['red'],
'letter': 5 * ['A', 'B'],
}
df = pandas.DataFrame(data)
df
df_selected = df[df['fruit'] == 'pear']
df_selected
df_selected.index
Index([3, 4, 5, 8], dtype='int64')
This certainly suggests I have a DataFrame with an Index containing the values 3, 4, 5 and 8.
I now want to select all the rows in the DataFrame from the first occurance of 'pear' to the last occurance, by using the integer index:
I thought this should be possible with the following syntax:
FIRST = 3
LAST = 8
df_selected[FIRST:LAST+1]
But I am mistaken:
stdout
or a Jupyter Notebook Cell) the index shows values 3, 4, 5, 8
.df_selected[A:B]
or df_selected.iloc[A:B]
the integer arguments A
and B
are interpreted as if df_selected.reset_index()
had been calledreset_index()
produces the following outputWhich implies the correct range to use when selecting by index is df_selected.iloc[0:3+1]
I am aware this is an incredibly basic question but I'm hoping someone can point me in the right direction as to understanding why the behaviour is this way, if there is a particular reason for it.
You should use loc
to select by label:
FIRST = 3
LAST = 8
df_selected = df[df['fruit'] == 'pear']
out = df.loc[FIRST:LAST]
Or:
idx = df[df['fruit'] == 'pear'].index
out = df.loc[idx.min():idx.max()]
NB. since loc
includes both ends, you do not need the +1
.
Output:
number fruit color letter
3 4 pear red B
4 5 pear green A
5 6 pear blue B
6 7 banana red A
7 8 banana green B
8 9 pear blue A
When using df_selected[1:2]
, this behaves like iloc
and selects from the fourth to the ninth position (so just the row with label 8
).
If your goal is to select all values between the first and last match, you could also use boolean indexing:
m = df['fruit'].eq('pear')
out = df[m.cummax()&m[::-1].cummax()]
How it works:
number fruit color letter m m.cummax() m[::-1].cummax() &
0 1 apple red A False False True False
1 2 apple green B False False True False
2 3 apple blue A False False True False
3 4 pear red B True True True True
4 5 pear green A True True True True
5 6 pear blue B True True True True
6 7 banana red A False True True True
7 8 banana green B False True True True
8 9 pear blue A True True True True
9 10 apple red B False True False False