I want to plot a Probability Density Function for the offsets from a log file. Here is the code:
timestamps = []
sequences = []
log_Name = 'test_rtt_25-01-17_13-07-41_values5_rate50.log'
log_Path = "/home/ubuntu/results-25-01-09-docker/"
true_Path = log_Path + log_Name
with open(true_Path, "r") as f:
for line in f:
if not line.startswith('#'):
time_part, seq_part = line.strip().split('(')
base, offset = time_part.split('+')
timestamps.append(float(offset))
seq = int(seq_part[:-1])
sequences.append(seq)
The code upon reads the data from the log file and then saves the offsets and sequences in 'timestamps' and 'sequences'.
Here is an example for 'timestamps' and 'sequences'.
[0.001009023, 0.001055868, 0.000992934, 0.001148472, 0.001086814, 0.001110649, 0.001066759, 0.00126167, 0.001231778, 0.000944345]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
As you can see there are 10 offsets and 10 sequences. Each offset has its number, for example: 0.001009023 is number 1. I want to plot a Probability Density Function, and i tried this:
source = {'seqs': sequences, 'times': timestamps}
df = pd.DataFrame(source)
df.sort_values(by = ['times'], inplace=True)
df_mean = np.mean(df['times'])
df_std = np.std(df['times'])
pdf = stats.norm.pdf(df['times'], df_mean, df_std)
plt.plot(df['times'], pdf)
plt.xlabel('Offsets') # Label for the x-axis
plt.savefig('/home/ubuntu/')
But the output y-axis is really strange, it looks like this:
Working complete code with input and imports
:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
timestamps = [0.001009023, 0.001055868, 0.000992934, 0.001148472, 0.001086814, 0.001110649, 0.001066759, 0.00126167, 0.001231778, 0.000944345]
sequences = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
source = {'seqs': sequences, 'times': timestamps}
df = pd.DataFrame(source)
df.sort_values(by = ['times'], inplace=True)
df_mean = np.mean(df['times'])
df_std = np.std(df['times'])
pdf = stats.norm.pdf(df['times'], df_mean, df_std)
plt.plot(df['times'], pdf)
plt.xlabel('Offsets') # Label for the x-axis
plt.savefig('fig.png')
plt.show()
Output:
I have no idea why the probability is much bigger than 1, it should be smaller than 1. Anyone knows where i did wrong?
Here are three estimates of the PDF of the distribution underlying your data: maximum likelihood estimate (MLE, normal distribution), kernel density estimate (KDE), and Rosenblatt's shifted histogram (RSH).
The MLE here is only valid if you have reason to suspect that your data is normally distributed; however, you can use the same approach to fit other distributions to your data and plot their PDF. KDE is a continuous nonparametric estimate, and RSH is a discrete nonparametric estimate.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats, integrate
times = [0.001009023, 0.001055868, 0.000992934, 0.001148472, 0.001086814,
0.001110649, 0.001066759, 0.00126167, 0.001231778, 0.000944345]
times = np.asarray(times)
x = np.linspace(0.0008, 0.0014, 300)
# Maximum likelihood estimate normal distribution
mu, sigma = stats.norm.fit(times) # simply the mean and uncorrected variance
X = stats.Normal(mu=mu, sigma=sigma)
# Kernel density estimate
Y = stats.gaussian_kde(times)
# Rosenblatt's Shifted Histogram
z = stats.mstats.rsh(times, points=x)
plt.plot(x, X.pdf(x), label='MLE Normal Distribution')
plt.plot(x, Y.evaluate(x), label='KDE')
plt.plot(x, z, label='RSH')
plt.legend()
plt.title("PDF Estimates")
I have no idea why the probability is much bigger than 1,
The probability density function of a continuous distribution evaluated at x
is not the probability that a random variable will assume value x
( In a probability density function for a continuous random variable any single outcome has probability zero of occurring. Probability density is the probability per unit length, in other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0 (since there is an infinite set of possible values to begin with SOURCE Wikipedia)), so it is not subject to the constraint you are thinking of (which is true of a probability mass function of a discrete distribution).
If the PDF is f(x)
, the probability that a random variable assumes a value between x
and x + h
is approximately h * f(x)
for sufficiently small h
. The relevant constraint here is that a valid PDF must be non-negative and integrate to 1 over the support. Indeed:
# provide limits of integration so the integrator can
# easily find the nonzero part of the function.
integrate.tanhsinh(X.pdf, 0, 0.15).integral
# 0.9999999999999987