I tried to compute the probablity distribution function of my iris dataset for petal lengths of setosa flowers using numpy.histogram
I wanted to plot the probablity distribution function for the petal length of the setosa flowers. Unfortunately i got confused in what actually np.histogram
returns us.
In the below code using my vague knowledge i set the bins to 10 and density to true.
Could anyone please provide any insight so as to what the below code does and essentially what a PDF is? Also is there any other better way to compute the PDF for the given data set?
import pandas as pd
import numpy as np
iris = pd.read_csv('iris.csv')
iris_setosa = iris[iris.species == 'setosa']
counts,bin_edges=np.histogram(iris_setosa["petal_length"],bins=10,density=True)
pdf=counts/sum(counts)
Let me put this way -
When you run the below line and print out counts, bin_edges variables
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10,density=True)
The result will be
counts --> [0.22222222 0.22222222 0.44444444 1.55555556 2.66666667 3.11111111 1.55555556 0.88888889 0. 0.44444444]
bin_edges --> [1. 1.09 1.18 1.27 1.36 1.45 1.54 1.63 1.72 1.81 1.9 ]
Data source - Iris Data set Numpy - Numpy
So what the above code does at the back end is the following:
1.Firstly, based on the bin width and the minimum and maximum values in setosa petal length data set, it will first calculate a certain bin width and then create a histogram where X axis would be petal length and Y axis would be the number of flowers. This you can see if you just remove the parameter density from the above code.
counts_number, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10)
This would result in -- counts_number --> [ 1 1 2 7 12 14 7 4 0 2] So this means there is just 1 flower in the bin [1-1.09).
2.Next it will calculate the relative frequency for each data point i.e it will divide the counts_number by total number of flowers(Here 50. I got this value from the data set available on google). You can see this by this :
rel_freq =counts_number/50
print(rel_freq)
This would result in -- > [0.02 0.02 0.04 0.14 0.24 0.28 0.14 0.08 0. 0.04]
These are relative frequencies and can also be interpreted as probability values. This interpretation is based on the concept of law of large numbers ([Law of large numbers])3
3.The Y values in any PDF´s are not actual probabilities, but are probability density. So if you divide the rel_freq by the bin width, we would get
--> [0.22222222 0.22222222 0.44444444 1.55555556 2.66666667 3.11111111 1.55555556 0.88888889 0. 0.44444444]
As you can see, this is same as the one which we got just by using density =True parameter
As you have not provided the complete code as what you are trying to do after calculating variable pdf . Let me make my assumptions and explain it further.
The Y axis values in any PDF will/can be more than 1 as they are densities and not probabilities. The code line in your program
pdf=counts/sum(counts)
normalizes the pdf numpy array. To put it in a more sensible way, the above line of code is doing the same thing as multiplying the counts array with bin width i.e. it is recalculating the relative frequencies(a.k.a probabilities) from the densities. So, if you run this below code line
print(counts*0.09) -- > here 0.09 is the bin width for bin size of 10
It will give --- > [0.02 0.02 0.04 0.14 0.24 0.28 0.14 0.08 0. 0.04]
This is exactly same as the variable pdf
Now, may be you can use this pdf array to calculate the cdf as CDF is the cumulative sum of the probabilities at each bin width. Using the counts directly in calculating CDF would not make sense.
Now, if we plot the pdfs with the help of below lines of code. Note - Make sure you import relevant libraries to plot . Below is just a example code
plt.plot(bin_edges[1:],pdf,label="normalised_pdf")
plt.plot(bin_edges[1:],counts,label="actual_pdf")
This would result in
You can see in the graph that, they are just scaled version of each other.