python python-3.x seaborn kernel-density

Get data array from a Seaborn pairplot

I have used the seaborn pairplot function and would like to extract a data array.

import seaborn as sns

iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species")

I want to get an array of the points I show below in black color:

Thanks.

Solution

Just this line:

data = iris[iris['species'] == 'setosa']['sepal_length']

You are interested in the blue line, so the 'setosa' scpecie. In order to filter the iris dataframe, I create this filter:

iris['species'] == 'setosa'

which is a boolean array, whose values are True if the corresponding row in the 'species' columns of the iris dataframe is 'setosa', False otherwise. With this line of code:

iris[iris['species'] == 'setosa']

I apply the filter to the dataframe, in order to extract only the rows associated with the 'setosa' specie. Finally, I extract the 'sepal_length' column:

iris[iris['species'] == 'setosa']['sepal_length']

If I plot a KDE for this data array with this code:

data = iris[iris['species'] == 'setosa']['sepal_length']
sns.kdeplot(data)

I get:

that is the plot above you are interested in

The values are different from the plot above by the way KDE is calculated.
I quote this reference:

The y-axis in a density plot is the probability density function for the kernel density estimation. However, we need to be careful to specify this is a probability density and not a probability. The difference is the probability density is the probability per unit on the x-axis. To convert to an actual probability, we need to find the area under the curve for a specific interval on the x-axis. Somewhat confusingly, because this is a probability density and not a probability, the y-axis can take values greater than one. The only requirement of the density plot is that the total area under the curve integrates to one. I generally tend to think of the y-axis on a density plot as a value only for relative comparisons between different categories.