Search code examples
pythonplotdata-cleaningfeature-engineeringscipy.stats

Why a norm distribution does not plot a line on stats.probplot()?


The problem is with the resultant graph of function scipy.stats.probplot(). Samples from a normal distribution doesn't produce a line as expected.

I am trying to normalize some data using graphs as guidance.

However, after some strange results showing that zscore and log transformations were having no effect, I started looking for something wrong.

So, I built a graph using synthetic values that has a norm distribution and the resultant graph seems very awkward.

Here is the steps to reproduce the array and the graph:

import math
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
norm = stats.norm.pdf(x, mu, sigma)

plt.plot(x, norm)
plt.show()
_ = stats.probplot(norm, plot=plt, sparams=(0, 1))
plt.show()

Distribution curve:

Distribution curve

Probability plot:

Probability plot


Solution

  • Your synthesized data aren't normally distributed, they are uniformly distributed, this is what numpy.linspace() does. You can visualize this by adding seaborn.distplot(x, fit=scipy.stats.norm).

    import math
    
    import matplotlib.pyplot as plt
    import numpy as np
    from scipy import stats
    import seaborn as sns
    
    
    mu = 0
    variance = 1
    sigma = math.sqrt(variance)
    x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
    y = stats.norm.pdf(x, mu, sigma)
    
    sns.distplot(y, fit=stats.norm)
    fig = plt.figure()
    res = stats.probplot(y, plot=plt, sparams=(0, 1))
    plt.show()
    

    Try synthesizing your data with numpy.random.normal(). This will give you normally distributed data.

    import math
    
    import matplotlib.pyplot as plt
    import numpy as np
    from scipy import stats
    import seaborn as sns
    
    
    mu = 0
    variance = 1
    sigma = math.sqrt(variance)
    x = np.random.normal(loc=mu, scale=sigma, size=100)
    
    sns.distplot(x, fit=stats.norm)
    fig = plt.figure()
    res = stats.probplot(x, plot=plt, sparams=(0, 1))
    plt.show()