Search code examples
pythonpandasmissing-data

How can we make pandas default handling of missing values warn of their presence rather than silently ignore them?


As discussed here, pandas silently replaces NaN values with 0 when calculating sums, in contrast to explicit calculations as shown here:

import pandas as pd
import numpy  as np

np.NaN + np.NaN                              # Result: nan
pd.DataFrame([np.NaN,np.NaN]).sum().item()   # Result: 0.0

pandas' Descriptive Statistics methods have a skipna argument. However, skipna is by default True, thereby masking the presence of missing values to casual users and novice programmers

This creates a risk that analyses will be "...quietly, accidentally wrong since their Pandas operators haven't used the correct skipna" .

In Python, is there a way for users to set skipna=False as the default option?


Solution

  • It's quite straightforward as in the documentation.

    skipna (bool, default True) - Exclude NA/null values when computing the result.

    The skipna paramter in the pd.DataFrame.sum() method defaults to True. So, when you take column sum, it skips the nan values and returns sum = 0.

    If you set it to False and you see the intended behavior. However, there is no way of defaulting it to False. You have to set it to false via the parameter, unless you define your own wrapper around it.

    import pandas as pd
    import numpy  as np
    
    np.NaN + np.NaN
    pd.DataFrame([np.NaN,np.NaN]).sum(skipna=False)
    
    0   NaN
    dtype: float64
    

    Here is a wrapper that can be defined to set your parameters to a custom value globally. This is code from this excellent SO answer.

    ## Function from - 
    ## https://stackoverflow.com/questions/55877832/setting-pandas-global-default-for-skipna-to-false
    
    def set_default(func, **default):
        def inner(*args, **kwargs):
            kwargs.update(default)        # Update function kwargs w/ decorator defaults
            return func(*args, **kwargs)  # Call function w/ updated kwargs
        return inner                      # Return decorated function
    
    pd.DataFrame.sum = set_default(pd.DataFrame.sum, skipna=False)
    pd.DataFrame([np.NaN,np.NaN]).sum()
    
    0   NaN
    dtype: float64