As far as the documentation tell, the seaborn kdeplot work by utilizing the scipy.stats.gaussian_kde.
However, I got two different distribution when plotting using the seaborn
and gaussian_kde
, despite using the same bandwidth
size.
In the picture above, the left is the distribution if the data feed directly into the gaussian_kde
. Wheras, the right ploting is the distribution if the data feed into seaborn kdeplot
.
Also, the area under the curve for a given boundary is not similar between these two ways of plotting the distribution.
auc using gaussian_kde : 47.7 and auc using via seaborn : 49.5
May I know what may cause this difference and is there a way to standardize the output regardless of the method use (e.g., seaborn
or gaussian_kde
)
The code to reproduce the above plot
and auc
is given below.
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
time_window_order = ['272', '268', '264', '260', '256', '252', '248', '244', '240']
order_dict = {k: i for i, k in enumerate ( time_window_order )}
df = pd.DataFrame ( {'time_window': ['268', '268', '268', '264', '252', '252', '252', '240',
'256', '256', '256', '256', '252', '252', '252', '240'],
'seq_no': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a',
'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b']} )
df ['centre_point'] = df ['time_window'].map ( order_dict )
filter_band = df ["seq_no"].isin ( ['a'] )
df = df [filter_band].reset_index ( drop=True )
auc_x_min, auc_x_max = 0, 4
bandwith=0.5
########################
plt.subplots(1, 2)
# make the first plot
plt.subplot(1, 2, 1)
kde0 = gaussian_kde ( df ['centre_point'], bw_method=bandwith )
xmin, xmax = -3, 12
x_1 = np.linspace ( xmin, xmax, 500 )
kde0_x = kde0 ( x_1 )
sel_region_x = x_1 [(x_1 > auc_x_min) * (x_1 < auc_x_max)]
sel_region_y = kde0_x [(x_1 > auc_x_min) * (x_1 < auc_x_max)]
auc_bond_1 = np.trapz ( sel_region_y, sel_region_x )
area_whole = np.trapz ( kde0_x, x_1 )
plt.plot ( x_1, kde0_x, color='b', label='KDE' )
plt.ylim(bottom=0)
plt.title(f'Direct gaussian_kde with bw {bandwith}')
plt.fill_between ( sel_region_x, sel_region_y, 0, facecolor='none', edgecolor='r', hatch='xx',
label='intersection' )
# make second plot
plt.subplot(1, 2, 2)
g = sns.kdeplot ( data=df, x="centre_point", bw_adjust=bandwith )
c = g.get_lines () [0].get_data ()
x_val = c [0]
kde0_x = c [1]
idx = (x_val> auc_x_min) * (x_val < auc_x_max)
sel_region_x = x_val [idx]
sel_region_y = kde0_x [idx]
auc_bond_2 = np.trapz ( sel_region_y, sel_region_x )
g.fill_between ( sel_region_x, sel_region_y, 0, facecolor='none', edgecolor='r', hatch='xx' )
plt.title(f'Via Seaborn with bw {bandwith}')
plt.tight_layout()
plt.show()
# show much the area differ between these two plotting
print ( f'auc using gaussian_kde : {auc_bond_1 * 100:.1f} and auc using via seaborn : {auc_bond_2 * 100:.1f}' )
x=1
Based on mwaskon, changes of these two lines
kde0 = gaussian_kde ( df ['centre_point'], bw_method='scott' )
g = sns.kdeplot ( data=df, x="centre_point", bw_adjust=1 ) # Seaborn by default use the scott method to determine the bw size
return
Visually, the two plot looks identical.
However, the auc
between the graph still return two different values
auc using gaussian_kde : 45.1 and auc using via seaborn : 44.6
You are calling scipy like this:
kde0 = gaussian_kde ( df ['centre_point'], bw_method=bandwith )
and seaborn like this
g = sns.kdeplot ( data=df, x="centre_point", bw_adjust=bandwith )
But the kdeplot docs tell us that bw_adjust
is a
Factor that multiplicatively scales the value chosen using bw_method. Increasing will make the curve smoother. See Notes.
whereas kdeplot also has a bw_method
parameter that is a
Method for determining the smoothing bandwidth to use; passed to scipy.stats.gaussian_kde.
So if you want to equate the results from the two libraries, you need to make sure you're using the right parameters.