Why do we use loc
for pandas dataframes? it seems the following code with or without using loc
both compiles and runs at a similar speed:
%timeit df_user1 = df.loc[df.user_id=='5561']
100 loops, best of 3: 11.9 ms per loop
or
%timeit df_user1_noloc = df[df.user_id=='5561']
100 loops, best of 3: 12 ms per loop
So why use loc
?
Edit: This has been flagged as a duplicate question. But although pandas iloc vs ix vs loc explanation? does mention that
you can do column retrieval just by using the data frame's
__getitem__
:df['time'] # equivalent to df.loc[:, 'time']
it does not say why we use loc
, although it does explain lots of features of loc
. But my specific question is: why not just omit loc
altogether? For this question, I have accepted a very detailed answer below.
Also in the above post, the answer (which I do not think is an answer) is really well hidden in the discussion, and any person searching for what I was, would find it hard to locate the information and would be much better served by the answer provided to my question here.
Explicit is better than implicit.
df[boolean_mask]
selects rows where boolean_mask
is True, but there is a corner case when you might not want it to: when df
has boolean-valued column labels:
In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df
Out[229]:
False True
0 3 1
1 4 2
2 5 3
You might want to use df[[True]]
to select the True
column. Instead it raises a ValueError
:
In [230]: df[[True]]
ValueError: Item wrong length 1 instead of 3.
Versus using loc
:
In [231]: df.loc[[True]]
Out[231]:
False True
0 3 1
In contrast, the following does not raise ValueError
even though the structure of df2
is almost the same as df1
above:
In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2
Out[258]:
A B
0 1 3
1 2 4
2 3 5
In [259]: df2[['B']]
Out[259]:
B
0 3
1 4
2 5
Thus, df[boolean_mask]
does not always behave the same as df.loc[boolean_mask]
. Even though this is arguably an unlikely use case, I would recommend always using df.loc[boolean_mask]
instead of df[boolean_mask]
because the meaning of df.loc
's syntax is explicit. With df.loc[indexer]
you know automatically that df.loc
is selecting rows. In contrast, it is not clear if df[indexer]
will select rows or columns (or raise ValueError
) without knowing details about indexer
and df
.
df.loc[row_indexer, column_index]
can select rows and columns. df[indexer]
can only select rows or columns depending on the type of values in indexer
and the type of column values df
has (again, are they boolean?).
In [237]: df2.loc[[True,False,True], 'B']
Out[237]:
0 3
2 5
Name: B, dtype: int64
When a slice is passed to df.loc
the end-points are included in the range. When a slice is passed to df[...]
, the slice is interpreted as a half-open interval:
In [239]: df2.loc[1:2]
Out[239]:
A B
1 2 4
2 3 5
In [271]: df2[1:2]
Out[271]:
A B
1 2 4