python pandas sorting group-by multi-index

How to sort string values as numeric ones in a multilevel index?

So I have been trying to sort a dataset by their assay numbers and their sample IDs, taking the mean of all the samples replicates for each assay number (Ex. A sample ran in one assay number could've been ran 1-10 times).

Doing so, I used the groupby method in pandas and was successful with the exception that sample names that were above 9 (ex. 10, 11, 20, 50) would be sorted near lower sample numbers (Ex.10 under 1, 20 under 2, 50 under 5) instead of where they should be in the column.

I understand it's a matter of the sorted_values method checking from left to right, but now I cannot even use the natsorted method to rectify it without completely undoing the groupby.

I've tried sorting the data beforehand to how I want it before doing the groupby, but that didn't work as it sorts it again after. I also tried sort = False but then the assay numbers are in the wrong place and I again cannot sort them properly.

import pandas as pd
import numpy as np
import natsort as ns

#Takes a sheet from an excel file and turns it into a data frame. 

excel_name = "~~~~"
from_sheet_name = "OARS 2-8 "
output_name = "~~~ Data2.xlsx"

df = pd.read_excel(excel_name, from_sheet_name)
df.head(5)

#Prepares the averaged ALUs
df2= df.groupby(["Assay Num", "Sample Name"],sort = True)
average_alus = df2["Resp 1"].agg(np.mean)
# # average_alus["Sample Name"] = ns.natsorted(average_alus["Sample 
Name"])
# # ns.natsorted(average_alus["Sample Name"])
# average_alus_df = pd.DataFrame(average_alus)
# type(average_alus_df)
# average_alus_df = average_alus_df.reset_index()
# average_alus_df["Sample Name"] = 
ns.natsorted(average_alus_df["Sample Name"])
# average_alus_df


# Prepares the data for the second sheet. Takes assay number, sample 
name, and ALUs, sorts them, and drops empty rows

df3 = df[["Assay Num","Sample Name", "Resp 1"]]
df3 = df3.sort_values(["Assay Num", "Sample Name"], ascending = 
[True, True])
# df3["Sample Name"] = ns.natsorted(df3["Sample Name"])
df3 = df3.dropna()


# Writes both sheets to an excel file
writer2 = pd.ExcelWriter(output_name)
average_alus.to_excel(writer2, sheet_name = "Averaged ALUs", index = 
"False")
df3.to_excel(writer2, sheet_name = "Raw Reps Sorted", index = 
"False")

writer2.save()

This was my first attempt, with my second attempts commented out. this is what I want the table to look like

Assay num	Sample Name	ALUs
175	CRM 1	1.507
	CRM 2	2.8
	~~~	~
	CRM 10	300

Vs how it actually looks like

Assay num	Sample Name	ALUs
175	CRM 1	1.507
	CRM 10	300
	CRM 2	2.8
	CRM 3	9.7

And once I try sorting it after turning the series back into a dataframe, the groupby becomes undone

Assay num	Sample Name	ALUs
175	CRM 1	1.407
175	CRM 1	1.5
175	CRM 1	1.3
175	CRM 2	9.7

Solution

Here is a toy dataframe mimicking yours:

import random

import pandas as pd

df = pd.DataFrame(
    {
        "Assay num": [176] * 10 + [175] * 10,
        "Sample Name": random.sample([f"CRM {i+1}" for i in range(10)], k=10) * 2,
        "ALUs": [random.randint(1, 2_000) for _ in range(10)] * 2,
    }
)

print(df)
# Output

    Assay num Sample Name  ALUs
0         176       CRM 3  1392
1         176       CRM 2   869
2         176       CRM 4  1732
3         176       CRM 1  1385
4         176       CRM 5    49
5         176       CRM 8  1898
6         176       CRM 6  1965
7         176      CRM 10   626
8         176       CRM 7   845
9         176       CRM 9  1191
10        175       CRM 3  1392
11        175       CRM 2   869
12        175       CRM 4  1732
13        175       CRM 1  1385
14        175       CRM 5    49
15        175       CRM 8  1898
16        175       CRM 6  1965
17        175      CRM 10   626
18        175       CRM 7   845
19        175       CRM 9  1191

Here is one way to sort string values as numeric ones in what ends up to be a multilevel index:

new_df = pd.concat(
    [
        sub_df.sort_values("Sample Name", key=lambda col: col.str[4:].astype(int))
        .groupby(["Assay num", "Sample Name"], sort=False)
        .agg(sum)  # will not sum anything
        for _, sub_df in df.groupby(["Assay num"])
    ]
)

print(new_df)
# Output

Assay num Sample Name      
175       CRM 1        1385
          CRM 2         869
          CRM 3        1392
          CRM 4        1732
          CRM 5          49
          CRM 6        1965
          CRM 7         845
          CRM 8        1898
          CRM 9        1191
          CRM 10        626
176       CRM 1        1385
          CRM 2         869
          CRM 3        1392
          CRM 4        1732
          CRM 5          49
          CRM 6        1965
          CRM 7         845
          CRM 8        1898
          CRM 9        1191
          CRM 10        626