I have used the seaborn pairplot function and would like to extract a data array.
import seaborn as sns
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species")
I want to get an array of the points I show below in black color:
Thanks.
Just this line:
data = iris[iris['species'] == 'setosa']['sepal_length']
You are interested in the blue line, so the 'setosa'
scpecie. In order to filter the iris
dataframe, I create this filter:
iris['species'] == 'setosa'
which is a boolean array, whose values are True
if the corresponding row in the 'species'
columns of the iris
dataframe is 'setosa'
, False
otherwise. With this line of code:
iris[iris['species'] == 'setosa']
I apply the filter to the dataframe, in order to extract only the rows associated with the 'setosa'
specie. Finally, I extract the 'sepal_length'
column:
iris[iris['species'] == 'setosa']['sepal_length']
If I plot a KDE for this data array with this code:
data = iris[iris['species'] == 'setosa']['sepal_length']
sns.kdeplot(data)
I get:
that is the plot above you are interested in
The values are different from the plot above by the way KDE is calculated.
I quote this reference:
The y-axis in a density plot is the probability density function for the kernel density estimation. However, we need to be careful to specify this is a probability density and not a probability. The difference is the probability density is the probability per unit on the x-axis. To convert to an actual probability, we need to find the area under the curve for a specific interval on the x-axis. Somewhat confusingly, because this is a probability density and not a probability, the y-axis can take values greater than one. The only requirement of the density plot is that the total area under the curve integrates to one. I generally tend to think of the y-axis on a density plot as a value only for relative comparisons between different categories.