I have a piece of code that works correctly (gives the expected answer), but is both inefficient and unnecessarily complex. It uses loops that I want to simplify and make more efficient, possibly using vectorized operations. It also converts a dataframe to a series and then back into a dataframe again - another code chunk that needs work. In other words, I want to make this piece of code more pythonic.
I marked the problematic places in the code (below) with comments that start with # TODO:
.
The goal of the code is to summarize and aggregate the input dataframe df
(which has the distributions of DNA fragment lengths for two types of regions: all
and captured
). This is a bioinformatic problem, part of a larger project that ranks enzymes by their ability to cut certain DNA regions into pieces of defined length. For the purpose of this question, the only relevant information is that length
is integer and there are two types of DNA regions
: all
and captured
. The aim is to produce the dataframe df_pur
with purity vs. length_cutoff
(the cutoff of length when purifying the DNA). The steps are:
regions
that is above each of the length_cutoffs
.captured / all
for each of the length_cutoffs
and store the result in a dataframe.import io
import pandas as pd
# This is a minimal reproducible example. The real dataset has 2
# columns and 10s of millions of rows. Column 1 is integer, column 2
# has 2 values: 'all' and 'captured':
TESTDATA="""
length regions
1 all
49 all
200 all
20 captured
480 captured
2000 captured
"""
df = pd.read_csv(io.StringIO(TESTDATA), sep='\s+')
# This is a minimal reproducible example. The real list has ~10
# integer values (cutoffs):
length_cutoffs = [10, 100, 1000]
df_tot_length = pd.DataFrame(columns=['tot_length'])
df_tot_length['tot_length'] = df.groupby(['regions']).length.sum()
df_tot_length.reset_index(inplace=True)
print(df_tot_length)
# regions tot_length
# 0 all 250
# 1 captured 2500
df_frc_tot = pd.DataFrame(columns=['regions', 'length_cutoff', 'sum_lengths'])
regions = df['regions'].unique()
df_index = pd.DataFrame({'regions': regions}).set_index('regions')
# TODO: simplify this loop (vectorize?):
for length_cutoff in length_cutoffs:
df_cur = (pd.DataFrame({'length_cutoff': length_cutoff,
'sum_lengths': df[df['length'] >= length_cutoff]
.groupby(['regions']).length.sum()},
# Prevent dropping rows where no elements
# are selected by the above
# condition. Re-insert the dropped rows,
# use for those sum_lengths = NaN
index=df_index.index)
# Correct the above sum_lengths = NaN to 0:
.fillna(0)).reset_index()
# Undo the effect of `fillna(0)` above, which casts the
# integer column as float:
df_cur['sum_lengths'] = df_cur['sum_lengths'].astype('int')
# TODO: simplify this loop (vectorize?):
for region in regions:
df_cur.loc[df_cur['regions'] == region, 'frc_tot_length'] = (
df_cur.loc[df_cur['regions'] == region, 'sum_lengths'] /
df_tot_length.loc[df_tot_length['regions'] == region, 'tot_length'])
df_frc_tot = pd.concat([df_frc_tot, df_cur], axis=0)
df_frc_tot.reset_index(inplace=True, drop=True)
print(df_frc_tot)
# regions length_cutoff sum_lengths frc_tot_length
# 0 all 10 249 0.996
# 1 captured 10 2500 1.000
# 2 all 100 200 0.800
# 3 captured 100 2480 0.992
# 4 all 1000 0 0.000
# 5 captured 1000 2000 0.800
# TODO: simplify the next 2 statements:
ser_pur = (df_frc_tot.loc[df_frc_tot['regions'] == 'captured', 'frc_tot_length']
.reset_index(drop=True) /
df_frc_tot.loc[df_frc_tot['regions'] == 'all', 'frc_tot_length']
.reset_index(drop=True))
df_pur = pd.DataFrame({'length_cutoff': length_cutoffs, 'purity': ser_pur})
print(df_pur)
# length_cutoff purity
# 0 10 1.004016
# 1 100 1.240000
# 2 1000 inf
Note:
I am primarily interested in making the code more clear, simple and pythonic.
Among the answers that are tied for the above, I will prefer the more efficient one in terms of speed.
I have 8 GB available by default per job, but can increase this to 32 GB if needed. To benchmark efficiency, please use this real life-size example dataframe:
num_rows = int(1e7)
df = pd.concat([
pd.DataFrame({'length': np.random.choice(range(1, 201), size=num_rows, replace=True), 'regions': 'all'}),
pd.DataFrame({'length': np.random.choice(range(20, 2001), size=num_rows, replace=True), 'regions': 'captured'}),
]).reset_index(drop=True)
IIUC, you can do:
length_cutoffs = [10, 100, 1000]
df["bins"] = pd.cut(
df["length"],
pd.IntervalIndex.from_breaks([-np.inf] + length_cutoffs + [np.inf], closed="left"),
)
out = df.pivot_table(index=["regions", "bins"], values="length", aggfunc="sum")
g = out.groupby(level=0)
out["frc_tot_length"] = (
g["length"].transform(lambda x: [x.iloc[i:].sum() for i in range(len(x))])
) / g["length"].sum()
print(out)
print()
This prints:
length frc_tot_length
regions bins
all [-inf, 10.0) 1 1.000
[10.0, 100.0) 49 0.996
[100.0, 1000.0) 200 0.800
[1000.0, inf) 0 0.000
captured [-inf, 10.0) 0 1.000
[10.0, 100.0) 20 1.000
[100.0, 1000.0) 480 0.992
[1000.0, inf) 2000 0.800
Then:
x = out.unstack(level=0)
x = x[("frc_tot_length", "captured")] / x[("frc_tot_length", "all")]
print(x)
Prints:
bins
[-inf, 10.0) 1.000000
[10.0, 100.0) 1.004016
[100.0, 1000.0) 1.240000
[1000.0, inf) inf
dtype: float64