Search code examples
pythonpython-3.xseabornkernel-density

Get data array from a Seaborn pairplot


I have used the seaborn pairplot function and would like to extract a data array.

import seaborn as sns

iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species")

I want to get an array of the points I show below in black color:

enter image description here

Thanks.


Solution

  • Just this line:

    data = iris[iris['species'] == 'setosa']['sepal_length']
    

    You are interested in the blue line, so the 'setosa' scpecie. In order to filter the iris dataframe, I create this filter:

    iris['species'] == 'setosa'
    

    which is a boolean array, whose values are True if the corresponding row in the 'species' columns of the iris dataframe is 'setosa', False otherwise. With this line of code:

    iris[iris['species'] == 'setosa']
    

    I apply the filter to the dataframe, in order to extract only the rows associated with the 'setosa' specie. Finally, I extract the 'sepal_length' column:

    iris[iris['species'] == 'setosa']['sepal_length']
    

    If I plot a KDE for this data array with this code:

    data = iris[iris['species'] == 'setosa']['sepal_length']
    sns.kdeplot(data)
    

    I get:

    enter image description here

    that is the plot above you are interested in

    The values are different from the plot above by the way KDE is calculated.
    I quote this reference:

    The y-axis in a density plot is the probability density function for the kernel density estimation. However, we need to be careful to specify this is a probability density and not a probability. The difference is the probability density is the probability per unit on the x-axis. To convert to an actual probability, we need to find the area under the curve for a specific interval on the x-axis. Somewhat confusingly, because this is a probability density and not a probability, the y-axis can take values greater than one. The only requirement of the density plot is that the total area under the curve integrates to one. I generally tend to think of the y-axis on a density plot as a value only for relative comparisons between different categories.