Search code examples
pythonpandasnormalizationvalueerrornormal-distribution

Log transformation-ValueError: cannot convert float NaN to integer


The data of some columns don't follow normal distribution and I wanted to normalize them by using log transformation.

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14,6))
#1
sns.distplot(train_df['MasVnrArea'], fit=stats.norm, ax=ax[0])
ax[0].set_title('Before Normalization')

#2
train_df['MasVnrArea'] = np.log(train_df['MasVnrArea'])
ax[1].set_title('After Normalization')
sns.distplot(train_df['MasVnrArea'], fit=stats.norm, ax=ax[1])

Part #1 works fine, but when it comes to part #2 it gives me this error:

ValueError: cannot convert float NaN to integer

I already check if there was a NaN value in this column, but there was nothing. So what's the problem with it?


Solution

  • When did you check if there are NaN values?

    Did you check if train_df['MasVnrArea'] have values equal or under 0? If there are values equal to or under 0, the log return NaN and the plot in the next line will throw the error.

    • Check again if there are NaN values after the log calculation.

    Example from Using numpy.log() on 0

    import numpy as np 
    print(np.log(0))
    

    Output:

    -inf 
    /usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: RuntimeWarning: divide by zero encountered in log
    

    Explanation:

    The logarithm of zero is not defined. It’s not a real number, because you can never get zero by raising anything to the power of anything else.