How to get Q-Q plot for Pareto distribution in Python?

Q-Q plots are used to get the goodness of fit between a set of data points and theoretical distribution. Following is the procedure to get the points.

Select the samples to use. Sort the selected samples with X(i) denoting the ith sample
Find the model values that correspond to the samples. This is done in two steps,

a. Associate each sample with the percentile it represents. pi = (i-0.5)/n

b. Calculate the model value that would be associated with this percentile. This is done by inverting the model CDF, as is done when generating random variates from the model distribution. Thus the model value corresponding to sample i is Finverse(pi).

c. Draw the Q-Q plot, using the n points

( X(i), Finverse(pi)) 1 ≤ i ≤ n

Using this approach I came up with the following python implementation.

_distn_names = ["pareto"]
def fit_to_all_distributions(data):
    dist_names = _distn_names

    params = {}
    for dist_name in dist_names:
        try:
            dist = getattr(st, dist_name)
            param = dist.fit(data)

            params[dist_name] = param
        except Exception:
            print("Error occurred in fitting")
            params[dist_name] = "Error"

    return params 

def get_q_q_plot(values, dist, params):
    values.sort()

    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    x = []

    for i in range(len(values)):
        x.append((i-0.5)/len(values))

    y = getattr(st, dist).ppf(x, loc=loc, scale=scale, *arg)

    y = list(y)

    emp_percentiles = values
    dist_percentiles = y

    print("Emperical Percentiles")
    print(emp_percentiles)

    print("Distribution Percentiles")
    print(dist_percentiles)

    plt.figure()
    plt.xlabel('dist_percentiles')
    plt.ylabel('actual_percentiles')
    plt.title('Q Q plot')
    plt.plot(dist_percentiles, emp_percentiles)
    plt.savefig("/path/q-q-plot.png")

b = 2.62
latencies = st.pareto.rvs(b, size=500)
data = pd.Series(latencies)
params = fit_to_all_distributions(data)

pareto_params = params["pareto"]

get_q_q_plot(latencies, "pareto", pareto_params)

Ideally I should get a straight line, but this is what I get.

Why don't I get a straight line? Is there anything wrong in my implementation?

Solution

You can get the Q-Q plot for any distribution (there are 82 in scipy stats) using the following code.

import os
import matplotlib.pyplot as plt
import sys
import math
import numpy as np
import scipy.stats as st
from scipy.stats._continuous_distns import _distn_names
from scipy.optimize import curve_fit

def get_q_q_plot(latency_values, distribution):

    distribution = getattr(st, distribution)
    params = distribution.fit(latency_values)

    latency_values.sort()

    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    x = []

    for i in range(1, len(latency_values)):
        x.append((i-0.5) / len(latency_values))

    y = distribution.ppf(x, loc=loc, scale=scale, *arg)

    y = list(y)

    emp_percentiles = latency_values[1:]
    dist_percentiles = y

    return emp_percentiles, dist_percentiles