Using scipy to fit CDF with real data, but CDF start not from 0

Herewith my samples and my codes for fitting CDF.

import numpy as np
import pandas as pd
import scipy.stats as st

samples = [2,3,10,7,9,6,1,3,7,2,5,4,6,3,4,1,4,6,3,10,3,7,5,6,6,5,4,2,2,5,4,5,6,4,4,6,3,3,3,2,2,2,4,2,6,2,7,4,3,2,2,1,4,2,2,5,3,9,6,8,3,6,6,3,9,2,3,3,3,5,4,4,5,4,1,8,5,8,6,6,7,6,3,2,4,2,16,6,2,3,4,2,2,9,9,5,5,5,1,5,2,8,5,3,5,8,11,4,7,4,11,3,7,3,6,6,1,4,2,1,1,1,9,4,15,2,1,3,4,9,3,3,4,3,6,3,3,5,5,6,3,3,4,8,4,4,2,5,6,7,3,5,5,2,5,9,7,6,1,3,4,9,3,2,4,8,5,8,4,4,5,6,5,8,6,1,3,7,9,6,7,12,4,1,4,5,5,7,1,7,1,15,3,3,2,3,7,7,15,6,5,1,7,4,2,10,1,3,3,8,3,8,1,5,4,7,4,2,9,2,1,3,6,1,6,10,6,3,4,7,5,7,3,3,7,4,4,3,5,3,5,2,2,1,2,3,1,1,2,1,1,2,3,10,7,3,2,6,5,6,5,11,1,7,5,2,9,5,12,6,3,9,9,4,3,4,6,4,10,4,8,6,1,7,2,5,8,3,1,3,1,1,3,3,2,2,6,3,3,2,6,6,6,4,2,4,1,10,5,3,5,6,3,4,1,1,7,6,6,5,7,6,3,4,6,6,5,3,2,3,2,1,2,4,1,1,1,3,7,1,6,3,4,3,3,6,7,3,7,4,1,1,7,1,4,4,3,4,2,4,2,6,6,2,2,6,5,4,6,5,6,3,5,1,5,3,3,2,2,2,2,3,3,3,2,2,1,4,2,3,5,7,2,5,1,2,2,5,6,5,2,1,2,4,5,2,3,2,4,9,3,5,2,2,5,4,2,3,4,2,3,1,3,6,7,2,6,3,5,4,2,2,2,2,1,2,5,2,2,3,4,2,5,2,2,3,5,3,2,4,3,2,5,4,1,4,8,6,8,2,2,3,1,2,3,8,2,3,4,3,3,2,1,1,1,3,3,4,3,4,1,2,8,2,2,7,3,1,2,3,3,2,3,1,2,1,1,1,3,2,2,2,4,7,2,1,2,3,1,3,1,1,6,2,1,1,3,1,4,4,1,3,1,1,4,1,1,2,4,4,3,2,3,2,1,2,1,4,2,5,3,4,2,1,1,1,3,1,2,1,1,4,2,1,3,2,1,3,2,1,1,1,2,1,1,1,1,2,1,1,1,1,1,1,1]

bins=np.arange(1, 18, 0.1)
#Because min(samples) = 1, so I start from 1.
y, x = np.histogram(samples, bins=bins, density=True)

params = st.lognorm.fit(samples)
# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]

ccdf = st.lognorm.cdf(x, loc=loc, scale=scale, *arg)
cdf = pd.Series(ccdf, x)

#cdf[1.0] is not 0... That is the issue...

When I print out the first value cdf[1.0], it does not equal to 0. According to theory, it should be 0. As the below picture has shown, the first CDF is not 0. I check my code again and again. However, I cannot fix the problem. If any suggestion to me, I very appreciate it.

Solution

In your code, you are trying to plot a bar chart from your sample. This is good, but on the graph you are not having a histogram, but a distribution function of the sample. The code does not match the picture.

Here is the pdf graph and histogram.

Code for graph above:

# ... insert your sample and calculate lognorm parameters (already in your code)
x = np.linspace(min(samples), max(samples), 100)
pdf = stats.lognorm.pdf(x, loc=loc, scale=scale, *arg)
plt.plot(x, pdf)
plt.hist(samples, bins=max(samples)-min(samples), density=True, alpha=0.75)
plt.show()

You are also looking in the code for cdf options. And Scipy finds them. And on the graph you draw exactly the cdf.

You don't understand that the cdf value for the minimum value in the sample is not zero.

However, you should be aware that the fit function only brings the approximated curve closer to your sample, it does not produce a curve that accurately describes the empirical distribution function.

Scipy just thinks your sample may contain values less than one, although there are no such values in the training set. The pdf also says that a value greater than 14 is extremely unlikely, but your sample has more than 13 values. As a result, cdf and should not be equal to zero at your point cdf[1.0].

p.s. cdf will still be equal to zero at zero if you pass this point to it.

Code for graph above:

# ... insert your sample and calculate lognorm parameters (already in your code)
x = np.linspace(0, max(samples), 100)
cdf = stats.lognorm.cdf(x, loc=loc, scale=scale, *arg)
plt.plot(x, cdf)
plt.show()