I'm trying to get the points from a KDE plot in order to send them via API so the plot can be displayed via frontend. For example, if I have the following data:
df = pd.DataFrame({'x': [3000.0,
2897.0,
4100.0,
2539.28,
5000.0,
3615.0,
2562.05,
2535.0,
2413.0,
2246.0],
'y': [1, 2, 1, 1, 1, 2, 1, 3, 1, 1]})
import seaborn as sns
sns.kdeplot(x=df['x'], weights=df['y'])
And I plot it using seaborn kdeplot it gives me this plot:
Now I wanted to send some points of this plot via an API. My idea was to use KernelDensity from sklearn to estimate the density of some points. So I used this code:
from sklearn.neighbors import KernelDensity
x_points = np.linspace(0, df['x'].max(), 30)
kde = KernelDensity()
kde.fit(df['x'].values.reshape(-1, 1), sample_weight=df['y'])
logprob = kde.score_samples(x_points.reshape(-1, 1))
new_df = pd.DataFrame({'x': x_points, 'y': np.exp(logprob)})
Which, if I plot using a lineplot, doesn't look anything like seaborn kdeplot.
My question is: Given a dataframe and the kdeplot shown, how can I get the probability of some point x in this plot?
EDIT: Adding code to plot sns.kdeplot
Why does the plot with sklearn
look different? Because the bandwidth is set to 1 by default. And it should be much higher looking at the scale of your x-data. You can simply fix this by changing one line:
kde = KernelDensity(bandwidth=500)
Now, Seaborn actually sets the bandwidth automatically, which Scipy allows you to do as explained here.
Seaborn is a layer on top of matplotlib, and returns matplotlib axes, so you can use the same answer to this question about getting data from a matplotlib plot.
import matplotlib.pyplot as plt
plt.gca().get_lines()[0].get_xydata()
The output of this looks as you want it:
array([[5.70706380e+02, 7.39051159e-07],
[6.01382697e+02, 9.00695337e-07],
[6.32059015e+02, 1.09427429e-06],
[6.62735333e+02, 1.32531892e-06],
[6.93411651e+02, 1.60015322e-06],
[7.24087969e+02, 1.92597554e-06],
[7.54764286e+02, 2.31094202e-06],
[7.85440604e+02, 2.76425104e-06],
[8.16116922e+02, 3.29622720e-06],
...])