Search code examples
pythonpandasseabornheatmap

heatmap for large dataframe after grouping


I have a huge data frame that looks something like what i've attached below. I would like to group-by the column 'sscinames' as it has similar entries like 'Acidobacteria bacterium' (get the sum of it for each sample) and then plot a heatmap. Would like the heat map to only show the top 20 'sscinames' based on the sample count. Any help would be appreciated.

sscinames   S3_Day90_P3 S3_Day60_P3 S3_Day0_P1  S3_Day60_P1 S3_Day90_P1
    Thermoplasmata archaeon 4   0   41  1   5
    Planctomycetes bacterium    5   3   0   1   40
    Acidobacteria bacterium 6   15  0   8   13
    Trebonia kvetii 0   0   16  1   7
    Nonomuraea sp. RK-328   24  4   4   1   2
    Nitrospirae bacterium   3   1   4   1   2
    Acidobacteria bacterium 11  11  0   9   27

enter image description here


Solution

  • This requires a few steps, but it is a perfect task for pandas and seaborn. I commented the example below to give you an idea of what is happening there.

    import pandas as pd
    import seaborn as sns
    
    # This is just to create a dataframe from your table, replace with importing yours
    df1 = pd.DataFrame([["Thermoplasmata archaeon",  4,  0, 41, 1,  5],
                        ["Planctomycetes bacterium", 5,  3,  0, 1, 40],
                        ["Acidobacteria bacterium",  6, 15,  0, 8, 13],
                        ["Trebonia kvetii",          0,  0, 16, 1,  7],
                        ["Nonomuraea sp. RK-328",   24,  4,  4, 1,  2],
                        ["Nitrospirae bacterium",    3,  1,  4, 1,  2],
                        ["Acidobacteria bacterium", 11, 11,  0, 9, 27]])
    df1.columns = ["sscinames", "S3_Day90_P3", "S3_Day60_P3",  "S3_Day0_P1", "S3_Day60_P1", "S3_Day90_P1"]
    
    # Create column with total sample count
    df2 = pd.DataFrame(df1.iloc[:,1:].T.sum()) # This sums (along rows, hence the .T) all samples for each row
    df2.columns = ["Total Samples"]
    # ...and merge with your data frame to add the new column (axis=1)
    df = pd.concat([df1, df2], axis=1)
    
    # Now, turn the first column into a pandas index (then seaborn uses it immediately to label the axes)
    df = df.set_index(['sscinames'])
    
    # You can now sort your dataframe by the new column and proceed with the top entries
    df = df.sort_values(by="Total Samples", ascending=False)
    df_slice2plot = df.iloc[:5, :] # This takes the top five rows and all the columns, change as needed
    
    # Now, on to plotting
    sns.heatmap(df_slice2plot.iloc[:, 1:]) # Plot numeric columns (first column, with index 0, are the labels)
    

    This is the result I get:

    A heatmap of the example data