Search code examples
pythondataframenumpymatplotlib

How to plot a Probability Density Function in Python?


I want to plot a Probability Density Function for the offsets from a log file. Here is the code:

timestamps = []
sequences = []

log_Name = 'test_rtt_25-01-17_13-07-41_values5_rate50.log'
log_Path = "/home/ubuntu/results-25-01-09-docker/"
true_Path = log_Path + log_Name
with open(true_Path, "r") as f:
    for line in f:
        if not line.startswith('#'):
            time_part, seq_part = line.strip().split('(')
            base, offset = time_part.split('+')

            timestamps.append(float(offset))

            seq = int(seq_part[:-1])
            sequences.append(seq)

The code upon reads the data from the log file and then saves the offsets and sequences in 'timestamps' and 'sequences'.

Here is an example for 'timestamps' and 'sequences'.

[0.001009023, 0.001055868, 0.000992934, 0.001148472, 0.001086814, 0.001110649, 0.001066759, 0.00126167, 0.001231778, 0.000944345]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

As you can see there are 10 offsets and 10 sequences. Each offset has its number, for example: 0.001009023 is number 1. I want to plot a Probability Density Function, and i tried this:

source = {'seqs': sequences, 'times': timestamps}
df = pd.DataFrame(source)
df.sort_values(by = ['times'], inplace=True)
df_mean = np.mean(df['times'])
df_std = np.std(df['times'])
pdf = stats.norm.pdf(df['times'], df_mean, df_std)
plt.plot(df['times'], pdf)
plt.xlabel('Offsets')  # Label for the x-axis
plt.savefig('/home/ubuntu/')

But the output y-axis is really strange, it looks like this: enter image description here

Working complete code with input and imports:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import scipy.stats as stats

timestamps = [0.001009023, 0.001055868, 0.000992934, 0.001148472, 0.001086814, 0.001110649, 0.001066759, 0.00126167, 0.001231778, 0.000944345]

sequences = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

source = {'seqs': sequences, 'times': timestamps}

df = pd.DataFrame(source)

df.sort_values(by = ['times'], inplace=True)

df_mean = np.mean(df['times'])

df_std = np.std(df['times'])

pdf = stats.norm.pdf(df['times'], df_mean, df_std)

plt.plot(df['times'], pdf)

plt.xlabel('Offsets')  # Label for the x-axis

plt.savefig('fig.png')

plt.show()

Output:

enter image description here

I have no idea why the probability is much bigger than 1, it should be smaller than 1. Anyone knows where i did wrong?


Solution

  • Here are three estimates of the PDF of the distribution underlying your data: maximum likelihood estimate (MLE, normal distribution), kernel density estimate (KDE), and Rosenblatt's shifted histogram (RSH).

    The MLE here is only valid if you have reason to suspect that your data is normally distributed; however, you can use the same approach to fit other distributions to your data and plot their PDF. KDE is a continuous nonparametric estimate, and RSH is a discrete nonparametric estimate.

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy import stats, integrate
    
    times = [0.001009023, 0.001055868, 0.000992934, 0.001148472, 0.001086814, 
             0.001110649, 0.001066759, 0.00126167, 0.001231778, 0.000944345]
    times = np.asarray(times)
    x = np.linspace(0.0008, 0.0014, 300)
    
    # Maximum likelihood estimate normal distribution
    mu, sigma = stats.norm.fit(times)  # simply the mean and uncorrected variance
    X = stats.Normal(mu=mu, sigma=sigma)
    
    # Kernel density estimate
    Y = stats.gaussian_kde(times)
    
    # Rosenblatt's Shifted Histogram
    z = stats.mstats.rsh(times, points=x)
    
    plt.plot(x, X.pdf(x), label='MLE Normal Distribution')
    plt.plot(x, Y.evaluate(x), label='KDE')
    plt.plot(x, z, label='RSH')
    plt.legend()
    plt.title("PDF Estimates")
    

    enter image description here

    I have no idea why the probability is much bigger than 1,

    The probability density function of a continuous distribution evaluated at x is not the probability that a random variable will assume value x( In a probability density function for a continuous random variable any single outcome has probability zero of occurring. Probability density is the probability per unit length, in other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0 (since there is an infinite set of possible values to begin with SOURCE Wikipedia)), so it is not subject to the constraint you are thinking of (which is true of a probability mass function of a discrete distribution).

    If the PDF is f(x), the probability that a random variable assumes a value between x and x + h is approximately h * f(x) for sufficiently small h. The relevant constraint here is that a valid PDF must be non-negative and integrate to 1 over the support. Indeed:

    # provide limits of integration so the integrator can
    # easily find the nonzero part of the function.
    integrate.tanhsinh(X.pdf, 0, 0.15).integral
    # 0.9999999999999987