Q-Q plots are used to get the goodness of fit between a set of data points and theoretical distribution. Following is the procedure to get the points.
Find the model values that correspond to the samples. This is done in two steps,
a. Associate each sample with the percentile it represents. pi = (i-0.5)/n
b. Calculate the model value that would be associated with this percentile. This is done by inverting the model CDF, as is done when generating random variates from the model distribution. Thus the model value corresponding to sample i is Finverse(pi).
c. Draw the Q-Q plot, using the n points
( X(i), Finverse(pi)) 1 ≤ i ≤ n
Using this approach I came up with the following python implementation.
_distn_names = ["pareto"]
def fit_to_all_distributions(data):
dist_names = _distn_names
params = {}
for dist_name in dist_names:
try:
dist = getattr(st, dist_name)
param = dist.fit(data)
params[dist_name] = param
except Exception:
print("Error occurred in fitting")
params[dist_name] = "Error"
return params
def get_q_q_plot(values, dist, params):
values.sort()
arg = params[:-2]
loc = params[-2]
scale = params[-1]
x = []
for i in range(len(values)):
x.append((i-0.5)/len(values))
y = getattr(st, dist).ppf(x, loc=loc, scale=scale, *arg)
y = list(y)
emp_percentiles = values
dist_percentiles = y
print("Emperical Percentiles")
print(emp_percentiles)
print("Distribution Percentiles")
print(dist_percentiles)
plt.figure()
plt.xlabel('dist_percentiles')
plt.ylabel('actual_percentiles')
plt.title('Q Q plot')
plt.plot(dist_percentiles, emp_percentiles)
plt.savefig("/path/q-q-plot.png")
b = 2.62
latencies = st.pareto.rvs(b, size=500)
data = pd.Series(latencies)
params = fit_to_all_distributions(data)
pareto_params = params["pareto"]
get_q_q_plot(latencies, "pareto", pareto_params)
Ideally I should get a straight line, but this is what I get.
Why don't I get a straight line? Is there anything wrong in my implementation?
You can get the Q-Q plot for any distribution (there are 82 in scipy stats) using the following code.
import os
import matplotlib.pyplot as plt
import sys
import math
import numpy as np
import scipy.stats as st
from scipy.stats._continuous_distns import _distn_names
from scipy.optimize import curve_fit
def get_q_q_plot(latency_values, distribution):
distribution = getattr(st, distribution)
params = distribution.fit(latency_values)
latency_values.sort()
arg = params[:-2]
loc = params[-2]
scale = params[-1]
x = []
for i in range(1, len(latency_values)):
x.append((i-0.5) / len(latency_values))
y = distribution.ppf(x, loc=loc, scale=scale, *arg)
y = list(y)
emp_percentiles = latency_values[1:]
dist_percentiles = y
return emp_percentiles, dist_percentiles