So I have been trying to sort a dataset by their assay numbers and their sample IDs, taking the mean of all the samples replicates for each assay number (Ex. A sample ran in one assay number could've been ran 1-10 times).
Doing so, I used the groupby method in pandas and was successful with the exception that sample names that were above 9 (ex. 10, 11, 20, 50) would be sorted near lower sample numbers (Ex.10 under 1, 20 under 2, 50 under 5) instead of where they should be in the column.
I understand it's a matter of the sorted_values method checking from left to right, but now I cannot even use the natsorted method to rectify it without completely undoing the groupby.
I've tried sorting the data beforehand to how I want it before doing the groupby, but that didn't work as it sorts it again after. I also tried sort = False but then the assay numbers are in the wrong place and I again cannot sort them properly.
import pandas as pd
import numpy as np
import natsort as ns
#Takes a sheet from an excel file and turns it into a data frame.
excel_name = "~~~~"
from_sheet_name = "OARS 2-8 "
output_name = "~~~ Data2.xlsx"
df = pd.read_excel(excel_name, from_sheet_name)
df.head(5)
#Prepares the averaged ALUs
df2= df.groupby(["Assay Num", "Sample Name"],sort = True)
average_alus = df2["Resp 1"].agg(np.mean)
# # average_alus["Sample Name"] = ns.natsorted(average_alus["Sample
Name"])
# # ns.natsorted(average_alus["Sample Name"])
# average_alus_df = pd.DataFrame(average_alus)
# type(average_alus_df)
# average_alus_df = average_alus_df.reset_index()
# average_alus_df["Sample Name"] =
ns.natsorted(average_alus_df["Sample Name"])
# average_alus_df
# Prepares the data for the second sheet. Takes assay number, sample
name, and ALUs, sorts them, and drops empty rows
df3 = df[["Assay Num","Sample Name", "Resp 1"]]
df3 = df3.sort_values(["Assay Num", "Sample Name"], ascending =
[True, True])
# df3["Sample Name"] = ns.natsorted(df3["Sample Name"])
df3 = df3.dropna()
# Writes both sheets to an excel file
writer2 = pd.ExcelWriter(output_name)
average_alus.to_excel(writer2, sheet_name = "Averaged ALUs", index =
"False")
df3.to_excel(writer2, sheet_name = "Raw Reps Sorted", index =
"False")
writer2.save()
This was my first attempt, with my second attempts commented out. this is what I want the table to look like
Assay num | Sample Name | ALUs |
---|---|---|
175 | CRM 1 | 1.507 |
CRM 2 | 2.8 | |
~~~ | ~ | |
CRM 10 | 300 |
Vs how it actually looks like
Assay num | Sample Name | ALUs |
---|---|---|
175 | CRM 1 | 1.507 |
CRM 10 | 300 | |
CRM 2 | 2.8 | |
CRM 3 | 9.7 |
And once I try sorting it after turning the series back into a dataframe, the groupby becomes undone
Assay num | Sample Name | ALUs |
---|---|---|
175 | CRM 1 | 1.407 |
175 | CRM 1 | 1.5 |
175 | CRM 1 | 1.3 |
175 | CRM 2 | 9.7 |
Here is a toy dataframe mimicking yours:
import random
import pandas as pd
df = pd.DataFrame(
{
"Assay num": [176] * 10 + [175] * 10,
"Sample Name": random.sample([f"CRM {i+1}" for i in range(10)], k=10) * 2,
"ALUs": [random.randint(1, 2_000) for _ in range(10)] * 2,
}
)
print(df)
# Output
Assay num Sample Name ALUs
0 176 CRM 3 1392
1 176 CRM 2 869
2 176 CRM 4 1732
3 176 CRM 1 1385
4 176 CRM 5 49
5 176 CRM 8 1898
6 176 CRM 6 1965
7 176 CRM 10 626
8 176 CRM 7 845
9 176 CRM 9 1191
10 175 CRM 3 1392
11 175 CRM 2 869
12 175 CRM 4 1732
13 175 CRM 1 1385
14 175 CRM 5 49
15 175 CRM 8 1898
16 175 CRM 6 1965
17 175 CRM 10 626
18 175 CRM 7 845
19 175 CRM 9 1191
Here is one way to sort string values as numeric ones in what ends up to be a multilevel index:
new_df = pd.concat(
[
sub_df.sort_values("Sample Name", key=lambda col: col.str[4:].astype(int))
.groupby(["Assay num", "Sample Name"], sort=False)
.agg(sum) # will not sum anything
for _, sub_df in df.groupby(["Assay num"])
]
)
print(new_df)
# Output
Assay num Sample Name
175 CRM 1 1385
CRM 2 869
CRM 3 1392
CRM 4 1732
CRM 5 49
CRM 6 1965
CRM 7 845
CRM 8 1898
CRM 9 1191
CRM 10 626
176 CRM 1 1385
CRM 2 869
CRM 3 1392
CRM 4 1732
CRM 5 49
CRM 6 1965
CRM 7 845
CRM 8 1898
CRM 9 1191
CRM 10 626