Search code examples
pythonpandassortinggroup-bymulti-index

How to sort string values as numeric ones in a multilevel index?


So I have been trying to sort a dataset by their assay numbers and their sample IDs, taking the mean of all the samples replicates for each assay number (Ex. A sample ran in one assay number could've been ran 1-10 times).

Doing so, I used the groupby method in pandas and was successful with the exception that sample names that were above 9 (ex. 10, 11, 20, 50) would be sorted near lower sample numbers (Ex.10 under 1, 20 under 2, 50 under 5) instead of where they should be in the column.

I understand it's a matter of the sorted_values method checking from left to right, but now I cannot even use the natsorted method to rectify it without completely undoing the groupby.

I've tried sorting the data beforehand to how I want it before doing the groupby, but that didn't work as it sorts it again after. I also tried sort = False but then the assay numbers are in the wrong place and I again cannot sort them properly.

import pandas as pd
import numpy as np
import natsort as ns

#Takes a sheet from an excel file and turns it into a data frame. 

excel_name = "~~~~"
from_sheet_name = "OARS 2-8 "
output_name = "~~~ Data2.xlsx"

df = pd.read_excel(excel_name, from_sheet_name)
df.head(5)

#Prepares the averaged ALUs
df2= df.groupby(["Assay Num", "Sample Name"],sort = True)
average_alus = df2["Resp 1"].agg(np.mean)
# # average_alus["Sample Name"] = ns.natsorted(average_alus["Sample 
Name"])
# # ns.natsorted(average_alus["Sample Name"])
# average_alus_df = pd.DataFrame(average_alus)
# type(average_alus_df)
# average_alus_df = average_alus_df.reset_index()
# average_alus_df["Sample Name"] = 
ns.natsorted(average_alus_df["Sample Name"])
# average_alus_df


# Prepares the data for the second sheet. Takes assay number, sample 
name, and ALUs, sorts them, and drops empty rows

df3 = df[["Assay Num","Sample Name", "Resp 1"]]
df3 = df3.sort_values(["Assay Num", "Sample Name"], ascending = 
[True, True])
# df3["Sample Name"] = ns.natsorted(df3["Sample Name"])
df3 = df3.dropna()


# Writes both sheets to an excel file
writer2 = pd.ExcelWriter(output_name)
average_alus.to_excel(writer2, sheet_name = "Averaged ALUs", index = 
"False")
df3.to_excel(writer2, sheet_name = "Raw Reps Sorted", index = 
"False")

writer2.save()

This was my first attempt, with my second attempts commented out. this is what I want the table to look like

Assay num Sample Name ALUs
175 CRM 1 1.507
CRM 2 2.8
~~~ ~
CRM 10 300

Vs how it actually looks like

Assay num Sample Name ALUs
175 CRM 1 1.507
CRM 10 300
CRM 2 2.8
CRM 3 9.7

And once I try sorting it after turning the series back into a dataframe, the groupby becomes undone

Assay num Sample Name ALUs
175 CRM 1 1.407
175 CRM 1 1.5
175 CRM 1 1.3
175 CRM 2 9.7

Solution

  • Here is a toy dataframe mimicking yours:

    import random
    
    import pandas as pd
    
    df = pd.DataFrame(
        {
            "Assay num": [176] * 10 + [175] * 10,
            "Sample Name": random.sample([f"CRM {i+1}" for i in range(10)], k=10) * 2,
            "ALUs": [random.randint(1, 2_000) for _ in range(10)] * 2,
        }
    )
    
    print(df)
    # Output
    
        Assay num Sample Name  ALUs
    0         176       CRM 3  1392
    1         176       CRM 2   869
    2         176       CRM 4  1732
    3         176       CRM 1  1385
    4         176       CRM 5    49
    5         176       CRM 8  1898
    6         176       CRM 6  1965
    7         176      CRM 10   626
    8         176       CRM 7   845
    9         176       CRM 9  1191
    10        175       CRM 3  1392
    11        175       CRM 2   869
    12        175       CRM 4  1732
    13        175       CRM 1  1385
    14        175       CRM 5    49
    15        175       CRM 8  1898
    16        175       CRM 6  1965
    17        175      CRM 10   626
    18        175       CRM 7   845
    19        175       CRM 9  1191
    

    Here is one way to sort string values as numeric ones in what ends up to be a multilevel index:

    new_df = pd.concat(
        [
            sub_df.sort_values("Sample Name", key=lambda col: col.str[4:].astype(int))
            .groupby(["Assay num", "Sample Name"], sort=False)
            .agg(sum)  # will not sum anything
            for _, sub_df in df.groupby(["Assay num"])
        ]
    )
    
    print(new_df)
    # Output
    
    Assay num Sample Name      
    175       CRM 1        1385
              CRM 2         869
              CRM 3        1392
              CRM 4        1732
              CRM 5          49
              CRM 6        1965
              CRM 7         845
              CRM 8        1898
              CRM 9        1191
              CRM 10        626
    176       CRM 1        1385
              CRM 2         869
              CRM 3        1392
              CRM 4        1732
              CRM 5          49
              CRM 6        1965
              CRM 7         845
              CRM 8        1898
              CRM 9        1191
              CRM 10        626