Search code examples
pythonpandaschained-assignment

What is the benefit of a dataframe view or copy


I've seen many questions about the infamous SettingWithCopy warning. I've even ventured to answer a few of them. Recently, I was putting together an answer that involved this topic and I wanted to present the benefit of a dataframe view. I failed to produce a tangible demonstration of why it is a good idea to create a dataframe view or whatever the thing is that generates SettingWithCopy

Consider df

df = pd.DataFrame([[1, 2], [3, 4]], list('ab'), list('AB'))
df

   A  B
x  1  2
y  3  4

and dfv which is a copy of df

dfv = df[['A']]

print(dfv.is_copy)

<weakref at 0000000010916E08; to 'DataFrame' at 000000000EBF95C0>

print(bool(dfv.is_copy))

True

I can generate the SettingWithCopy

dfv.iloc[0, 0] = 0

enter image description here


However, dfv has changed

print(dfv)

   A
a  0
b  3

df has not

print(df)

   A  B
x  1  2
y  3  4

and dfv is still a copy

print(bool(dfv.is_copy))

True

If I change df

df.iloc[0, 0] = 7
print(df)

   A  B
x  7  2
y  3  4

But dfv has not changed. However, I can reference df from dfv

print(dfv.is_copy())

   A  B
x  7  2
y  3  4

Question

If dfv maintains it's own data (meaning, it doesn't actually save memory) and it assigns values via assignment operations despite warnings, then why do we bother saving references in the first place and generate SettingWithCopyWarning at all?

What is the tangible benefit?


Solution

  • There is a lot of existing discussion on this, see here for instance, including the attempted PRs. It's also worth noting that true copy-on-write for views is being considered as part of the "pandas 2.0" refactor, see here.

    The reason the reference is being maintained in your example is specifically because it's not a view, so that if someone tries this, they'll get a warning.

    df[['A']].iloc[0, 0] = 1
    

    Edit:

    In terms of "why use views at all," it's for performance / memory reasons. Consider, basic indexing (selecting a column), because this operation takes a view, it is almost instantaneous.

    df = pd.DataFrame(np.random.randn(1000000, 2), columns=['a','b'])
    
    %timeit df['a']
    100000 loops, best of 3: 2.13 µs per loop
    

    Whereas taking a copy has a non-trivial cost.

    %timeit df['a'].copy()
    100 loops, best of 3: 4.28 ms per loop
    

    This performance cost would show up in many operations, for instance adding two Series together.

    %timeit df['a'] + df['b']
    100 loops, best of 3: 4.31 ms per loop
    
    %timeit df['a'].copy() + df['b'].copy()
    100 loops, best of 3: 13.3 ms per loop