Search code examples
pythonmatplotlibseabornkernel-density

Finding the total probability under a shape in a bivariate KDE plot


I have a set of points stored as (x,y) values. My goal is the map these onto a coordinate plane and create a continuous PDF distribution.

I would like to apply polygons to the figure and get the total summed probability under the shape and return that to the user. The shape would be stored as a series of coordinates, so [(0,0), (1,0), (1,1),(0,1),(0,0)] would represent a square.

So far, I have plotted the points using a seaborn.kdeplot, and that generates a beautiful plot, which the sum of every point adds to around 100%.

However, I am struggling to effectively apply shapes directly the the graph. I've tried a few online solutions, but have been unable to find any method to get the cumulative probability under a shape directly on the kde plot. My code is below:

def get_field_matrix(csv_name, coordinates):
    
    # ... Some code that loads in a csv and names it df_heatmap
    # df_heatmap has two columns, ActualX and ActualY which are the series of points for the kde
    
    # Create a KDE-based heatmap using seaborn
    kde = sns.kdeplot(x='ActualX', y='ActualY', data=df_heatmap, cmap='viridis', fill=True, cbar=True, weights=np.linalg.norm(df_heatmap, axis=1) ** .3)
    
    # Set labels and title
    plt.xlabel('X-axis')
    plt.ylabel('Y-axis')
    plt.title('KDE-based Heatmap of X, Y Coordinates')

    # Coordinates is the list of tuples that make up the shape to be applied. 
    # Any shape may be represented, but it will always be a shape draw with a single line
    # Creates a matplotlib path out of the coordinates
    shape_path = Path(coordinates)
    shape_patch = PathPatch(shape_path, lw=0)
    plt.gca().add_patch(shape_patch)

    print("Summed Probability over the shape:", "A")

    # Set aspect ratio to be equal
    plt.gca().set_aspect('equal', adjustable='box')

    # Show the plot
    plt.show()
    return kde

Is there some library function I am missing to apply the cdf?


Solution

  • The example below shows how you can place a patch over a KDE plot, and integrate the area underneath.

    I don't think the KDE data from seaborn (left) is directly accessible, so I've run a separate KDE (right) that tries to match seaborn.

    The green points represent the user-defined coordinates, and the hatched area is just to confirm that the patch has been correctly interpolated onto the same grid as the KDE data.

    enter image description here

    A KDE estimator is first fitted to the data, and then used to get a KDE estimate over a fine grid. The user-defined patch coordinates are used to build a triangulated surface, which is then interpolated onto the fine rectangular grid from before. The patch and KDE are now in the same space. They are finally multiplied and integrated over.


    import matplotlib.pyplot as plt
    import seaborn as sns
    import numpy as np
    import pandas as pd
    
    #
    # Create test dataset
    #
    from sklearn.datasets import make_moons
    np.random.seed(0)
    X, _ = make_moons(n_samples=500, noise=0.1)
    X_df = pd.DataFrame({'x0': X[:, 0], 'x1': X[:, 1]})
    
    kde_thresh = 0.05 #default sns clipping
    
    if True: #View datapoints
        sns.kdeplot(
            X_df, x='x0', y='x1', cmap='inferno', alpha=0.4,
            thresh=kde_thresh, cbar=True
        )
        sns.scatterplot(X_df, x='x0', y='x1', color='black', marker='.')
        sns_limits = plt.axis()
        plt.gcf().set_size_inches(6, 3)
        plt.title('seaborn kde plot')
        plt.show()
    
    #
    # Model using kernel density estimator
    #
    from scipy.stats import gaussian_kde
    kd_estimator = gaussian_kde(X_df.values.T, bw_method='scott')
    
    #Create grid for plotting estimator results
    x_axis = np.linspace(sns_limits[0], sns_limits[1], num=200) #200 is sns kde default
    y_axis = np.linspace(sns_limits[2], sns_limits[3], num=200)
    x_grid, y_grid = np.meshgrid(x_axis, y_axis)
    
    xy_paired = np.column_stack([x_grid.ravel(), y_grid.ravel()])
    kde_scores = kd_estimator.evaluate(xy_paired.T).reshape(x_grid.shape)
    
    from matplotlib import ticker
    f, ax = plt.subplots(figsize=(6.2, 3))
    c = ax.contour(
        x_grid, y_grid, np.ma.masked_less(kde_scores, kde_thresh),
        cmap='inferno', alpha=0.5
    )
    ax.scatter('x0', 'x1', data=X_df, marker='.', color='black', s=7)
    ax.figure.colorbar(c, ticks=ticker.MaxNLocator(10))
    ax.set(xlabel='x0', ylabel='x1')
    ax.set_title('manual kde plot')
    
    #
    # Place a patch over the KDE
    #
    
    #User-defined patch
    # patch = np.array( [[-2, -2], [2, -2], [2, 2], [-2, 2]] )     #most of the area
    # patch = np.array( [(0, 0), (1, 0), (1, 1), (0, 1), (0, 0)] ) #square
    # patch = np.array( [(0,1), (1,1), (1,0), (0,1)] )             #triangle
    patch = np.array( [(0,1), (1,1), (1,0), (.8, .8), (0,1)] )   #concave
    
    patch_x = patch[:, 0]
    patch_y = patch[:, 1]
    
    #View patch
    ax.scatter(patch_x, patch_y, marker='o', color='darkgreen', alpha=0.7)
    ax.fill(patch_x, patch_y, color='darkgreen', linestyle='none', alpha=0.5)
    
    #Map patch onto 2D xy grid
    import matplotlib.path as mpath
    patch_path = mpath.Path(np.column_stack([patch_x, patch_y]))
    patch_interp = patch_path.contains_points(
        np.column_stack([x_grid.ravel(), y_grid.ravel()])
    ).reshape(x_grid.shape)
    patch_interp = np.where(patch_interp, 1, 0)
    
    #Visually confirm that the polygon has been correctly mapped onto the rectilinear grid
    ax.contourf(x_grid, y_grid, patch_interp, levels=[0.9, 1], hatches=['//'], colors='none')
    
    #
    #Finally, extract the KDE portion and integrate its area
    #
    kde_patch = kde_scores * patch_interp
    grid_area = np.diff(x_axis)[0] * np.diff(y_axis)[0]
    area = (kde_patch * grid_area).sum()
    
    ax.set_title(r'$\int KDE$ over patch is ' + f'{area:.3f}')