I have a huge data frame that looks something like what i've attached below. I would like to group-by the column 'sscinames' as it has similar entries like 'Acidobacteria bacterium' (get the sum of it for each sample) and then plot a heatmap. Would like the heat map to only show the top 20 'sscinames' based on the sample count. Any help would be appreciated.
sscinames S3_Day90_P3 S3_Day60_P3 S3_Day0_P1 S3_Day60_P1 S3_Day90_P1
Thermoplasmata archaeon 4 0 41 1 5
Planctomycetes bacterium 5 3 0 1 40
Acidobacteria bacterium 6 15 0 8 13
Trebonia kvetii 0 0 16 1 7
Nonomuraea sp. RK-328 24 4 4 1 2
Nitrospirae bacterium 3 1 4 1 2
Acidobacteria bacterium 11 11 0 9 27
This requires a few steps, but it is a perfect task for pandas and seaborn. I commented the example below to give you an idea of what is happening there.
import pandas as pd
import seaborn as sns
# This is just to create a dataframe from your table, replace with importing yours
df1 = pd.DataFrame([["Thermoplasmata archaeon", 4, 0, 41, 1, 5],
["Planctomycetes bacterium", 5, 3, 0, 1, 40],
["Acidobacteria bacterium", 6, 15, 0, 8, 13],
["Trebonia kvetii", 0, 0, 16, 1, 7],
["Nonomuraea sp. RK-328", 24, 4, 4, 1, 2],
["Nitrospirae bacterium", 3, 1, 4, 1, 2],
["Acidobacteria bacterium", 11, 11, 0, 9, 27]])
df1.columns = ["sscinames", "S3_Day90_P3", "S3_Day60_P3", "S3_Day0_P1", "S3_Day60_P1", "S3_Day90_P1"]
# Create column with total sample count
df2 = pd.DataFrame(df1.iloc[:,1:].T.sum()) # This sums (along rows, hence the .T) all samples for each row
df2.columns = ["Total Samples"]
# ...and merge with your data frame to add the new column (axis=1)
df = pd.concat([df1, df2], axis=1)
# Now, turn the first column into a pandas index (then seaborn uses it immediately to label the axes)
df = df.set_index(['sscinames'])
# You can now sort your dataframe by the new column and proceed with the top entries
df = df.sort_values(by="Total Samples", ascending=False)
df_slice2plot = df.iloc[:5, :] # This takes the top five rows and all the columns, change as needed
# Now, on to plotting
sns.heatmap(df_slice2plot.iloc[:, 1:]) # Plot numeric columns (first column, with index 0, are the labels)
This is the result I get: