Search code examples
pythonnumpymatplotlibstatisticsprobability-density

Probability density function for a set of values using numpy


Below is the data for which I want to plot the PDF. https://gist.github.com/ecenm/cbbdcea724e199dc60fe4a38b7791eb8#file-64_general-out

Below is the script

import numpy as np
import matplotlib.pyplot as plt
import pylab

data = np.loadtxt('64_general.out')
H,X1 = np.histogram( data, bins = 10, normed = True, density = True) # Is this the right way to get the PDF ?
plt.xlabel('Latency')
plt.ylabel('PDF')
plt.title('PDF of latency values')

plt.plot(X1[1:], H)
plt.show()

When I plot the above, I get the following.

  1. Is the above the correct way to calculate the PDF of a range of values
  2. Is there any other way to confirm that the results I get is the actual PDF. For example, how can show the area under pdf = 1 for my case.

enter image description here


Solution

    1. It is a legit way of approximating the PDF. Since np.histogram uses various techniques for binning the values you won't get the exact frequency of each number in your input. For a more exact approximation you should count the occurrence of each number and divide it by the total count. Also, since these are discrete values, the plot could be plotted as points or bars to give a more correct impression.

    2. In the discrete case, the sum of the frequencies should equal 1. In the continuous case you can for example use np.trapz() to approximate the integral.