Search code examples
pythonpandasgroup-byaggregatevectorization

Aggregate dataframe using simpler (vectorized?) operations instead of loops


I have a piece of code that works correctly (gives the expected answer), but is both inefficient and unnecessarily complex. It uses loops that I want to simplify and make more efficient, possibly using vectorized operations. It also converts a dataframe to a series and then back into a dataframe again - another code chunk that needs work. In other words, I want to make this piece of code more pythonic.

I marked the problematic places in the code (below) with comments that start with # TODO:.

The goal of the code is to summarize and aggregate the input dataframe df (which has the distributions of DNA fragment lengths for two types of regions: all and captured). This is a bioinformatic problem, part of a larger project that ranks enzymes by their ability to cut certain DNA regions into pieces of defined length. For the purpose of this question, the only relevant information is that length is integer and there are two types of DNA regions: all and captured. The aim is to produce the dataframe df_pur with purity vs. length_cutoff (the cutoff of length when purifying the DNA). The steps are:

  • Compute the fraction of the total length for each type of regions that is above each of the length_cutoffs.
  • Find the ratio of this fraction: captured / all for each of the length_cutoffs and store the result in a dataframe.
import io
import pandas as pd

# This is a minimal reproducible example. The real dataset has 2
# columns and 10s of millions of rows. Column 1 is integer, column 2
# has 2 values: 'all' and 'captured':
TESTDATA="""
length   regions
     1       all
    49       all
   200       all
    20  captured
   480  captured
  2000  captured
"""

df = pd.read_csv(io.StringIO(TESTDATA), sep='\s+')

# This is a minimal reproducible example. The real list has ~10
# integer values (cutoffs):
length_cutoffs = [10, 100, 1000]

df_tot_length = pd.DataFrame(columns=['tot_length'])
df_tot_length['tot_length'] = df.groupby(['regions']).length.sum()
df_tot_length.reset_index(inplace=True)

print(df_tot_length)

#     regions  tot_length
# 0       all         250
# 1  captured        2500


df_frc_tot = pd.DataFrame(columns=['regions', 'length_cutoff', 'sum_lengths'])
regions = df['regions'].unique()
df_index = pd.DataFrame({'regions': regions}).set_index('regions')

# TODO: simplify this loop (vectorize?):
for length_cutoff in length_cutoffs:
    df_cur = (pd.DataFrame({'length_cutoff': length_cutoff,
                            'sum_lengths': df[df['length'] >= length_cutoff]
                            .groupby(['regions']).length.sum()},
                           # Prevent dropping rows where no elements
                           # are selected by the above
                           # condition. Re-insert the dropped rows,
                           # use for those sum_lengths = NaN
                        index=df_index.index)
              # Correct the above sum_lengths = NaN to 0:
              .fillna(0)).reset_index()
    # Undo the effect of `fillna(0)` above, which casts the
    # integer column as float:
    df_cur['sum_lengths'] = df_cur['sum_lengths'].astype('int')
    # TODO: simplify this loop (vectorize?):
    for region in regions:
        df_cur.loc[df_cur['regions'] == region, 'frc_tot_length'] = (
            df_cur.loc[df_cur['regions'] == region, 'sum_lengths'] /
            df_tot_length.loc[df_tot_length['regions'] == region, 'tot_length'])
    df_frc_tot = pd.concat([df_frc_tot, df_cur], axis=0)

df_frc_tot.reset_index(inplace=True, drop=True)

print(df_frc_tot)

#     regions length_cutoff sum_lengths  frc_tot_length
# 0       all            10         249           0.996
# 1  captured            10        2500           1.000
# 2       all           100         200           0.800
# 3  captured           100        2480           0.992
# 4       all          1000           0           0.000
# 5  captured          1000        2000           0.800

# TODO: simplify the next 2 statements:
ser_pur = (df_frc_tot.loc[df_frc_tot['regions'] == 'captured', 'frc_tot_length']
           .reset_index(drop=True) /
           df_frc_tot.loc[df_frc_tot['regions'] == 'all',      'frc_tot_length']
           .reset_index(drop=True))
df_pur = pd.DataFrame({'length_cutoff': length_cutoffs, 'purity': ser_pur})

print(df_pur)

#    length_cutoff    purity
# 0             10  1.004016
# 1            100  1.240000
# 2           1000       inf

Note:

I am primarily interested in making the code more clear, simple and pythonic.

Among the answers that are tied for the above, I will prefer the more efficient one in terms of speed.

I have 8 GB available by default per job, but can increase this to 32 GB if needed. To benchmark efficiency, please use this real life-size example dataframe:

num_rows = int(1e7)
df = pd.concat([
    pd.DataFrame({'length': np.random.choice(range(1, 201),   size=num_rows, replace=True), 'regions': 'all'}),
    pd.DataFrame({'length': np.random.choice(range(20, 2001), size=num_rows, replace=True), 'regions': 'captured'}),
]).reset_index(drop=True)

Solution

  • IIUC, you can do:

    length_cutoffs = [10, 100, 1000]
    
    df["bins"] = pd.cut(
        df["length"],
        pd.IntervalIndex.from_breaks([-np.inf] + length_cutoffs + [np.inf], closed="left"),
    )
    
    out = df.pivot_table(index=["regions", "bins"], values="length", aggfunc="sum")
    
    g = out.groupby(level=0)
    out["frc_tot_length"] = (
        g["length"].transform(lambda x: [x.iloc[i:].sum() for i in range(len(x))])
    ) / g["length"].sum()
    print(out)
    print()
    

    This prints:

                              length  frc_tot_length
    regions  bins                                   
    all      [-inf, 10.0)          1           1.000
             [10.0, 100.0)        49           0.996
             [100.0, 1000.0)     200           0.800
             [1000.0, inf)         0           0.000
    captured [-inf, 10.0)          0           1.000
             [10.0, 100.0)        20           1.000
             [100.0, 1000.0)     480           0.992
             [1000.0, inf)      2000           0.800
    

    Then:

    x = out.unstack(level=0)
    x = x[("frc_tot_length", "captured")] / x[("frc_tot_length", "all")]
    print(x)
    

    Prints:

    bins
    [-inf, 10.0)       1.000000
    [10.0, 100.0)      1.004016
    [100.0, 1000.0)    1.240000
    [1000.0, inf)           inf
    dtype: float64