I am trying to understand the inticacies of using loc
on a dataframe. Suppose we have the following:
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
df2 = df.loc[:,'a']
df2.loc[0] = 10
print(df)
print(df2)
a b
0 10 4
1 2 5
2 3 6
0 10
1 2
2 3
Name: a, dtype: int64
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
df3 = df.loc[:,['a']]
df3.loc[0] = 10
print(df)
print(df3)
a b
0 1 4
1 2 5
2 3 6
a
0 10
1 2
2 3
Why does the first piece of code modify the original dataframe, whereas the second does not?
Because in your first code, df2
is a view of df
:
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
df2 = df.loc[:,'a']
df2._is_view
# True
Use copy
to ensure having a copy:
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
df2 = df.loc[:,'a'].copy()
df2._is_view
# False
Because in the first case the slice is a Series (1D object) and in the second a DataFrame (2D):
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
df.loc[:,'a'].shape
# (3,) -> this is 1D (Series)
df.loc[:,'a'].ndim
# 1
df.loc[:,['a']].shape
# (3,1) -> this is 2D (DataFrame)
df.loc[:,['a']].ndim
# 2