I've seen many questions about the infamous SettingWithCopy
warning. I've even ventured to answer a few of them. Recently, I was putting together an answer that involved this topic and I wanted to present the benefit of a dataframe view. I failed to produce a tangible demonstration of why it is a good idea to create a dataframe view or whatever the thing is that generates SettingWithCopy
Consider df
df = pd.DataFrame([[1, 2], [3, 4]], list('ab'), list('AB'))
df
A B
x 1 2
y 3 4
and dfv
which is a copy of df
dfv = df[['A']]
print(dfv.is_copy)
<weakref at 0000000010916E08; to 'DataFrame' at 000000000EBF95C0>
print(bool(dfv.is_copy))
True
I can generate the SettingWithCopy
dfv.iloc[0, 0] = 0
However, dfv
has changed
print(dfv)
A
a 0
b 3
df
has not
print(df)
A B
x 1 2
y 3 4
and dfv
is still a copy
print(bool(dfv.is_copy))
True
If I change df
df.iloc[0, 0] = 7
print(df)
A B
x 7 2
y 3 4
But dfv
has not changed. However, I can reference df
from dfv
print(dfv.is_copy())
A B
x 7 2
y 3 4
If dfv
maintains it's own data (meaning, it doesn't actually save memory) and it assigns values via assignment operations despite warnings, then why do we bother saving references in the first place and generate SettingWithCopyWarning
at all?
What is the tangible benefit?
There is a lot of existing discussion on this, see here for instance, including the attempted PRs. It's also worth noting that true copy-on-write for views is being considered as part of the "pandas 2.0" refactor, see here.
The reason the reference is being maintained in your example is specifically because it's not a view, so that if someone tries this, they'll get a warning.
df[['A']].iloc[0, 0] = 1
Edit:
In terms of "why use views at all," it's for performance / memory reasons. Consider, basic indexing (selecting a column), because this operation takes a view, it is almost instantaneous.
df = pd.DataFrame(np.random.randn(1000000, 2), columns=['a','b'])
%timeit df['a']
100000 loops, best of 3: 2.13 µs per loop
Whereas taking a copy has a non-trivial cost.
%timeit df['a'].copy()
100 loops, best of 3: 4.28 ms per loop
This performance cost would show up in many operations, for instance adding two Series
together.
%timeit df['a'] + df['b']
100 loops, best of 3: 4.31 ms per loop
%timeit df['a'].copy() + df['b'].copy()
100 loops, best of 3: 13.3 ms per loop