Search code examples
pandasdataframecutresampling

In Pandas, how can a DataFrame be binned by two columns, with the other columns changed to the means within those bins?


I've got the standard iris dataset projected down to two dimensions using UMAP, with the UMAP dimensions for the x and y positions of the 2D plot added as columns to the dataframe:

import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_iris
import umap # pip install umap-learn

iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = pd.Series(iris.target).map(dict(zip(range(3), iris.target_names)))

_umap = umap.UMAP().fit_transform(iris.data)
iris_df['UMAP_x'] = _umap[:,0]
iris_df['UMAP_y'] = _umap[:,1]
iris_df.head()

I'd like to bin both the UMAP_x and UMAP_y columns into like 25 bins and then the other columns in the dataframe change to being the mean values of the columns in each of the bins. How might this be done? It feels like cut or resampling might lead to the answer, but I'm not sure how.


Solution

  • You can use cut to define bins and then use groupby with transform to calculate mean value for each bin.

    import numpy as np
    import math
    import matplotlib.pyplot as plt
    import pandas as pd
    import seaborn as sns
    from sklearn.datasets import load_iris
    import umap
    
    iris = load_iris()
    iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
    iris_df['species'] = pd.Series(iris.target).map(dict(zip(range(3), iris.target_names)))
    
    _umap = umap.UMAP().fit_transform(iris.data)
    iris_df['UMAP_x'] = _umap[:,0]
    iris_df['UMAP_y'] = _umap[:,1]
    
    # Define bins for UMAP_x and UMAP_y params
    iris_df['UMAP_x_bin'] = pd.cut(iris_df['UMAP_x'], bins=25)
    iris_df['UMAP_y_bin'] = pd.cut(iris_df['UMAP_y'], bins=25)
    
    # Calculate mean value for each bin
    iris_df['UMAP_x_mean'] = iris_df.groupby('UMAP_x_bin')['UMAP_x'].transform('mean')
    iris_df['UMAP_y_mean'] = iris_df.groupby('UMAP_y_bin')['UMAP_y'].transform('mean')
    
    iris_df.head()