I am trying to summarize/aggregate a dataframe as follows. While the code gives the correct result, it is very repetitive and I would like to avoid that. I think that I need to use something like groupby
, agg
, apply
, etc. but could not find a way to do that. The goal is to compute df_summ
at the end. I think that I am using too many intermediate dataframes with selections of rows, and too many merge
s to put the results together. I feel that there must be a simpler way, but cannot figure it out.
The real df_stats
input dataframe has millions of rows, and df_summ
output dataframe has dozens of columns. The input shown below is just a minimal reproducible example.
import io
import pandas as pd
TESTDATA="""
enzyme regions N length
AaaI all 10 238045
AaaI all 20 170393
AaaI all 30 131782
AaaI all 40 103790
AaaI all 50 81246
AaaI all 60 62469
AaaI all 70 46080
AaaI all 80 31340
AaaI all 90 17188
AaaI captured 10 292735
AaaI captured 20 229824
AaaI captured 30 193605
AaaI captured 40 163710
AaaI captured 50 138271
AaaI captured 60 116122
AaaI captured 70 95615
AaaI captured 80 73317
AaaI captured 90 50316
AagI all 10 88337
AagI all 20 19144
AagI all 30 11030
AagI all 40 8093
AagI all 50 6394
AagI all 60 4991
AagI all 70 3813
AagI all 80 2759
AagI all 90 1666
AagI captured 10 34463
AagI captured 20 19220
AagI captured 30 15389
AagI captured 40 12818
AagI captured 50 10923
AagI captured 60 9261
AagI captured 70 7753
AagI captured 80 6201
AagI captured 90 4495
"""
df_stats = pd.read_csv(io.StringIO(TESTDATA), sep='\s+')
df_cap_N90 = df_stats[(df_stats['N'] == 90) & (df_stats['regions'] == 'captured')].drop(columns=['regions', 'N'])
df_cap_N50 = df_stats[(df_stats['N'] == 50) & (df_stats['regions'] == 'captured')].drop(columns=['regions', 'N'])
df_all_N50 = df_stats[(df_stats['N'] == 50) & (df_stats['regions'] == 'all') ].drop(columns=['regions', 'N'])
df_summ_cap_N50_all_N50 = pd.merge(df_cap_N50, df_all_N50, on='enzyme', how='inner', suffixes=('_cap_N50', '_all_N50'))
df_summ_cap_N50_all_N50['cap_N50_all_N50'] = (df_summ_cap_N50_all_N50['length_cap_N50'] -
df_summ_cap_N50_all_N50['length_all_N50'])
print(df_summ_cap_N50_all_N50)
df_summ_cap_N90_all_N50 = pd.merge(df_cap_N90, df_all_N50, on='enzyme', how='inner', suffixes=('_cap_N90', '_all_N50'))
df_summ_cap_N90_all_N50['cap_N90_all_N50'] = df_summ_cap_N90_all_N50['length_cap_N90'] - df_summ_cap_N90_all_N50['length_all_N50']
print(df_summ_cap_N90_all_N50)
df_summ = pd.merge(df_summ_cap_N50_all_N50.drop(columns=['length_cap_N50', 'length_all_N50']),
df_summ_cap_N90_all_N50.drop(columns=['length_cap_N90', 'length_all_N50']),
on='enzyme', how='inner')
print(df_summ)
Prints:
enzyme length_cap_N50 length_all_N50 cap_N50_all_N50
0 AaaI 138271 81246 57025
1 AagI 10923 6394 4529
enzyme length_cap_N90 length_all_N50 cap_N90_all_N50
0 AaaI 50316 81246 -30930
1 AagI 4495 6394 -1899
enzyme cap_N50_all_N50 cap_N90_all_N50
0 AaaI 57025 -30930
1 AagI 4529 -1899
Notes on bioinformatics background behind this question:
(Feel free to skip this, it describes the domain knowledge behind the python code)
The above code is one step in a multi-step bioinformatics project, where I try to find optimal restriction enzymes based on the way they cut DNA.
As input for this step, I have a table with restriction enzymes (whose name is stored in column enzyme
). I want to rank the enzymes based on the statistical properties of the way the cut DNA. Column regions
stores two different DNA regions types, which I want to differentiate using these enzymes. Column N
is the name of the statistic that measures the degree of how finely the DNA is cut (N10, ..., N90), and length
is the value of this statistic. The N
statistics summarize the DNA fragment length distribution, measured in units of nucleotides), similar in spirit to quantiles (10%, ..., 90%). When I compare the the enzymes, I want to do simple operations, such as cap_N90_all_N50 = { captured N90 } - { all N50 }
, etc. Then I rank the enzymes by the combination of cap_N50_all_N50
, etc.
You don't describe the logic and I can't understand why you don't compute df_all_N90
.
That being said, to get your expected output, you can can try a pivot
/ sub
:
piv = (df_stats.loc[df_stats["N"].isin([50, 90])]
.pivot(index="enzyme", columns=["regions", "N"], values="length"))
out = (piv["captured"].sub(piv[("all", 50)], axis=0)
.add_prefix("cap_N").add_suffix("_all_N50").reset_index())
Output :
print(out)
N enzyme cap_N50_all_N50 cap_N90_all_N50
0 AaaI 57025 -30930
1 AagI 4529 -1899
[2 rows x 3 columns]