Search code examples
pythonscipyseabornauc

Diffrence KDE rendering when using scipy gaussian_kde and seaborn kdeplot


As far as the documentation tell, the seaborn kdeplot work by utilizing the scipy.stats.gaussian_kde.

However, I got two different distribution when plotting using the seaborn and gaussian_kde, despite using the same bandwidth size.

enter image description here

In the picture above, the left is the distribution if the data feed directly into the gaussian_kde. Wheras, the right ploting is the distribution if the data feed into seaborn kdeplot.

Also, the area under the curve for a given boundary is not similar between these two ways of plotting the distribution.

auc using gaussian_kde : 47.7 and auc using via seaborn : 49.5

May I know what may cause this difference and is there a way to standardize the output regardless of the method use (e.g., seaborn or gaussian_kde)

The code to reproduce the above plot and auc is given below.

import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde


time_window_order = ['272', '268', '264', '260', '256', '252', '248', '244', '240']
order_dict = {k: i for i, k in enumerate ( time_window_order )}
df = pd.DataFrame ( {'time_window': ['268', '268', '268', '264', '252', '252', '252', '240',
                                     '256', '256', '256', '256', '252', '252', '252', '240'],
                     'seq_no': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a',
                                'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b']} )
df ['centre_point'] = df ['time_window'].map ( order_dict )
filter_band = df ["seq_no"].isin ( ['a'] )
df = df [filter_band].reset_index ( drop=True )
auc_x_min, auc_x_max = 0, 4
bandwith=0.5
########################

plt.subplots(1, 2)
# make the first plot
plt.subplot(1, 2, 1)
kde0 = gaussian_kde ( df ['centre_point'], bw_method=bandwith )
xmin, xmax = -3, 12
x_1 = np.linspace ( xmin, xmax, 500 )
kde0_x = kde0 ( x_1 )
sel_region_x = x_1 [(x_1 > auc_x_min) * (x_1 < auc_x_max)]
sel_region_y = kde0_x [(x_1 > auc_x_min) * (x_1 < auc_x_max)]
auc_bond_1 = np.trapz ( sel_region_y, sel_region_x )
area_whole = np.trapz ( kde0_x, x_1 )
plt.plot ( x_1, kde0_x, color='b', label='KDE' )
plt.ylim(bottom=0)
plt.title(f'Direct gaussian_kde with bw {bandwith}')
plt.fill_between ( sel_region_x, sel_region_y, 0, facecolor='none', edgecolor='r', hatch='xx',
                   label='intersection' )

# make second plot
plt.subplot(1, 2, 2)

g = sns.kdeplot ( data=df, x="centre_point", bw_adjust=bandwith )
c = g.get_lines () [0].get_data ()
x_val = c [0]
kde0_x = c [1]
idx = (x_val> auc_x_min) * (x_val < auc_x_max)
sel_region_x = x_val [idx]
sel_region_y = kde0_x [idx]
auc_bond_2 = np.trapz ( sel_region_y, sel_region_x )
g.fill_between ( sel_region_x, sel_region_y, 0, facecolor='none', edgecolor='r', hatch='xx' )
plt.title(f'Via Seaborn with bw {bandwith}')
plt.tight_layout()
plt.show()

# show much the area differ between these two plotting
print ( f'auc using gaussian_kde : {auc_bond_1 * 100:.1f} and auc using via seaborn : {auc_bond_2 * 100:.1f}' )
x=1

Edit

Based on mwaskon, changes of these two lines

kde0 = gaussian_kde ( df ['centre_point'], bw_method='scott' )

g = sns.kdeplot ( data=df, x="centre_point", bw_adjust=1 ) # Seaborn by default use the scott method to determine the bw size

return

enter image description here

Visually, the two plot looks identical.

However, the auc between the graph still return two different values

auc using gaussian_kde : 45.1 and auc using via seaborn : 44.6


Solution

  • You are calling scipy like this:

    kde0 = gaussian_kde ( df ['centre_point'], bw_method=bandwith )
    

    and seaborn like this

    g = sns.kdeplot ( data=df, x="centre_point", bw_adjust=bandwith )
    

    But the kdeplot docs tell us that bw_adjust is a

    Factor that multiplicatively scales the value chosen using bw_method. Increasing will make the curve smoother. See Notes.

    whereas kdeplot also has a bw_method parameter that is a

    Method for determining the smoothing bandwidth to use; passed to scipy.stats.gaussian_kde.

    So if you want to equate the results from the two libraries, you need to make sure you're using the right parameters.