Search code examples
pythonnumpypandasscipy

how to zscore normalize pandas column with nans?


I have a pandas dataframe with a column of real values that I want to zscore normalize:

>> a
array([    nan,  0.0767,  0.4383,  0.7866,  0.8091,  0.1954,  0.6307,
        0.6599,  0.1065,  0.0508])
>> df = pandas.DataFrame({"a": a})

The problem is that a single nan value makes all the array nan:

>> from scipy.stats import zscore
>> zscore(df["a"])
array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])

What's the correct way to apply zscore (or an equivalent function not from scipy) to a column of a pandas dataframe and have it ignore the nan values? I'd like it to be same dimension as original column with np.nan for values that can't be normalized

edit: maybe the best solution is to use scipy.stats.nanmean and scipy.stats.nanstd? I don't see why the degrees of freedom need to be changed for std for this purpose:

zscore = lambda x: (x - scipy.stats.nanmean(x)) / scipy.stats.nanstd(x)

Solution

  • Well the pandas' versions of mean and std will hand the Nan so you could just compute that way (to get the same as scipy zscore I think you need to use ddof=0 on std):

    df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)
    print df
    
            a    zscore
    0     NaN       NaN
    1  0.0767 -1.148329
    2  0.4383  0.071478
    3  0.7866  1.246419
    4  0.8091  1.322320
    5  0.1954 -0.747912
    6  0.6307  0.720512
    7  0.6599  0.819014
    8  0.1065 -1.047803
    9  0.0508 -1.235699