Search code examples
pythonstatisticsprobability

How to get Q-Q plot for Pareto distribution in Python?


Q-Q plots are used to get the goodness of fit between a set of data points and theoretical distribution. Following is the procedure to get the points.

  1. Select the samples to use. Sort the selected samples with X(i) denoting the ith sample
  2. Find the model values that correspond to the samples. This is done in two steps,

    a. Associate each sample with the percentile it represents. pi = (i-0.5)/n

    b. Calculate the model value that would be associated with this percentile. This is done by inverting the model CDF, as is done when generating random variates from the model distribution. Thus the model value corresponding to sample i is Finverse(pi).

    c. Draw the Q-Q plot, using the n points

( X(i), Finverse(pi)) 1 ≤ i ≤ n

Using this approach I came up with the following python implementation.

_distn_names = ["pareto"]
def fit_to_all_distributions(data):
    dist_names = _distn_names

    params = {}
    for dist_name in dist_names:
        try:
            dist = getattr(st, dist_name)
            param = dist.fit(data)

            params[dist_name] = param
        except Exception:
            print("Error occurred in fitting")
            params[dist_name] = "Error"

    return params 

def get_q_q_plot(values, dist, params):
    values.sort()

    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    x = []

    for i in range(len(values)):
        x.append((i-0.5)/len(values))

    y = getattr(st, dist).ppf(x, loc=loc, scale=scale, *arg)

    y = list(y)

    emp_percentiles = values
    dist_percentiles = y

    print("Emperical Percentiles")
    print(emp_percentiles)

    print("Distribution Percentiles")
    print(dist_percentiles)

    plt.figure()
    plt.xlabel('dist_percentiles')
    plt.ylabel('actual_percentiles')
    plt.title('Q Q plot')
    plt.plot(dist_percentiles, emp_percentiles)
    plt.savefig("/path/q-q-plot.png")

b = 2.62
latencies = st.pareto.rvs(b, size=500)
data = pd.Series(latencies)
params = fit_to_all_distributions(data)

pareto_params = params["pareto"]

get_q_q_plot(latencies, "pareto", pareto_params)

Ideally I should get a straight line, but this is what I get.

QQ plot

Why don't I get a straight line? Is there anything wrong in my implementation?


Solution

  • You can get the Q-Q plot for any distribution (there are 82 in scipy stats) using the following code.

    import os
    import matplotlib.pyplot as plt
    import sys
    import math
    import numpy as np
    import scipy.stats as st
    from scipy.stats._continuous_distns import _distn_names
    from scipy.optimize import curve_fit
    
    def get_q_q_plot(latency_values, distribution):
    
        distribution = getattr(st, distribution)
        params = distribution.fit(latency_values)
    
        latency_values.sort()
    
        arg = params[:-2]
        loc = params[-2]
        scale = params[-1]
    
        x = []
    
        for i in range(1, len(latency_values)):
            x.append((i-0.5) / len(latency_values))
    
        y = distribution.ppf(x, loc=loc, scale=scale, *arg)
    
        y = list(y)
    
        emp_percentiles = latency_values[1:]
        dist_percentiles = y
    
        return emp_percentiles, dist_percentiles