Search code examples
pythonpandasgroup-bymergeaggregate

Aggregate dataframe using less repetitive code than a bunch of select and merge operations


I am trying to summarize/aggregate a dataframe as follows. While the code gives the correct result, it is very repetitive and I would like to avoid that. I think that I need to use something like groupby, agg, apply, etc. but could not find a way to do that. The goal is to compute df_summ at the end. I think that I am using too many intermediate dataframes with selections of rows, and too many merges to put the results together. I feel that there must be a simpler way, but cannot figure it out.

The real df_stats input dataframe has millions of rows, and df_summ output dataframe has dozens of columns. The input shown below is just a minimal reproducible example.

import io
import pandas as pd

TESTDATA="""
enzyme  regions   N   length
AaaI    all       10  238045
AaaI    all       20  170393
AaaI    all       30  131782
AaaI    all       40  103790
AaaI    all       50  81246
AaaI    all       60  62469
AaaI    all       70  46080
AaaI    all       80  31340
AaaI    all       90  17188
AaaI    captured  10  292735
AaaI    captured  20  229824
AaaI    captured  30  193605
AaaI    captured  40  163710
AaaI    captured  50  138271
AaaI    captured  60  116122
AaaI    captured  70  95615
AaaI    captured  80  73317
AaaI    captured  90  50316
AagI    all       10  88337
AagI    all       20  19144
AagI    all       30  11030
AagI    all       40  8093
AagI    all       50  6394
AagI    all       60  4991
AagI    all       70  3813
AagI    all       80  2759
AagI    all       90  1666
AagI    captured  10  34463
AagI    captured  20  19220
AagI    captured  30  15389
AagI    captured  40  12818
AagI    captured  50  10923
AagI    captured  60  9261
AagI    captured  70  7753
AagI    captured  80  6201
AagI    captured  90  4495
"""


df_stats = pd.read_csv(io.StringIO(TESTDATA), sep='\s+')

df_cap_N90 = df_stats[(df_stats['N'] == 90) & (df_stats['regions'] == 'captured')].drop(columns=['regions', 'N'])
df_cap_N50 = df_stats[(df_stats['N'] == 50) & (df_stats['regions'] == 'captured')].drop(columns=['regions', 'N'])

df_all_N50 = df_stats[(df_stats['N'] == 50) & (df_stats['regions'] == 'all')     ].drop(columns=['regions', 'N'])

df_summ_cap_N50_all_N50 = pd.merge(df_cap_N50, df_all_N50, on='enzyme', how='inner', suffixes=('_cap_N50', '_all_N50'))
df_summ_cap_N50_all_N50['cap_N50_all_N50'] = (df_summ_cap_N50_all_N50['length_cap_N50'] -
                                              df_summ_cap_N50_all_N50['length_all_N50'])
print(df_summ_cap_N50_all_N50)

df_summ_cap_N90_all_N50 = pd.merge(df_cap_N90, df_all_N50, on='enzyme', how='inner', suffixes=('_cap_N90', '_all_N50'))
df_summ_cap_N90_all_N50['cap_N90_all_N50'] = df_summ_cap_N90_all_N50['length_cap_N90'] - df_summ_cap_N90_all_N50['length_all_N50']
print(df_summ_cap_N90_all_N50)


df_summ = pd.merge(df_summ_cap_N50_all_N50.drop(columns=['length_cap_N50', 'length_all_N50']),
                   df_summ_cap_N90_all_N50.drop(columns=['length_cap_N90', 'length_all_N50']),
                   on='enzyme', how='inner')
print(df_summ)

Prints:

  enzyme  length_cap_N50  length_all_N50  cap_N50_all_N50
0   AaaI          138271           81246            57025
1   AagI           10923            6394             4529
  enzyme  length_cap_N90  length_all_N50  cap_N90_all_N50
0   AaaI           50316           81246           -30930
1   AagI            4495            6394            -1899
  enzyme  cap_N50_all_N50  cap_N90_all_N50
0   AaaI            57025           -30930
1   AagI             4529            -1899

Notes on bioinformatics background behind this question:

(Feel free to skip this, it describes the domain knowledge behind the python code)

The above code is one step in a multi-step bioinformatics project, where I try to find optimal restriction enzymes based on the way they cut DNA.

As input for this step, I have a table with restriction enzymes (whose name is stored in column enzyme). I want to rank the enzymes based on the statistical properties of the way the cut DNA. Column regions stores two different DNA regions types, which I want to differentiate using these enzymes. Column N is the name of the statistic that measures the degree of how finely the DNA is cut (N10, ..., N90), and length is the value of this statistic. The N statistics summarize the DNA fragment length distribution, measured in units of nucleotides), similar in spirit to quantiles (10%, ..., 90%). When I compare the the enzymes, I want to do simple operations, such as cap_N90_all_N50 = { captured N90 } - { all N50 }, etc. Then I rank the enzymes by the combination of cap_N50_all_N50, etc.


Solution

  • You don't describe the logic and I can't understand why you don't compute df_all_N90.

    That being said, to get your expected output, you can can try a pivot / sub :

    piv = (df_stats.loc[df_stats["N"].isin([50, 90])]
           .pivot(index="enzyme", columns=["regions", "N"], values="length"))
    
    out = (piv["captured"].sub(piv[("all", 50)], axis=0)
               .add_prefix("cap_N").add_suffix("_all_N50").reset_index())
    

    Output :

    print(out)
    
    N enzyme  cap_N50_all_N50  cap_N90_all_N50
    0   AaaI            57025           -30930
    1   AagI             4529            -1899
    
    [2 rows x 3 columns]