Search code examples
pythonpandasgroup-bypivot-tablesummary

How to use pandas groupby, pivot_table, or crosstab to summarize this dataset


I have the following pandas DataFrame:

enter image description here

account_num = [
    1726905620833, 1727875510892, 1727925550921, 1727925575731, 1727345507414, 
    1713565531401, 1725735509119, 1727925546516, 1727925523656, 1727875509665, 
    1727875504742, 1727345504314, 1725475539855, 1791725523833, 1727925583805, 
    1727925544791, 1727925518810, 1727925606986, 1727925618602, 1727605517337, 
    1727605517354, 1727925583101, 1727925583201, 1727925583335, 1727025517810, 
    1727935718602]
total_due = [
    1662.87, 3233.73, 3992.05, 10469.28, 799.01, 2292.98, 297.07, 5699.06, 1309.82, 
    1109.67, 4830.57, 3170.12, 45329.73, 46.71, 11981.58, 3246.31, 3214.25, 2056.82, 
    1611.73, 5386.16, 2622.02, 5011.02, 6222.10, 16340.90, 1239.23, 1198.98]
net_returned = [
    0.0, 0.0, 0.0, 2762.64, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12008.27, 
    0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2762.69, 0.0, 0.0, 0.0, 9254.66, 0.0, 0.0]
total_fees = [
    0.0, 0.0, 0.0, 607.78, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2161.49, 0.0, 
    0.0, 0.0, 0.0, 0.0, 0.0, 536.51, 0.0, 0.0, 0.0, 1712.11, 0.0, 0.0]
flipped = [1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0]
proba = [
    0.960085, 0.022535, 0.013746, 0.025833, 0.076159, 0.788912, 0.052489, 0.035279, 
    0.019701, 0.552127, 0.063949, 0.061279, 0.024398, 0.902681, 0.009441, 0.015342, 
    0.006832, 0.032988, 0.031879, 0.026412, 0.025159, 0.023195, 0.022104, 0.021285, 
    0.026480, 0.025837]

d = { 
    "account_num" : account_num,
    "total_due" : total_due,
    "net_returned" : net_returned,
    "total_fees" : total_fees,
    "flipped" : flipped,
    "proba" : proba
}
df = pd.DataFrame(data=d)

Based on the probability score (proba), I assign each record a decile ([1 - 9]). From there, I build a table that summarizes various financial statistics. Each decile is assigned a dollar threshold, such that if the record's total_due amount is greater than the decile's dollar threshold, it is reviewed. Reviewed records are then eligible to be flipped, and if they are flipped, we can possibly get money returned (net_returned), and received fees (total_fees).

I tried adding the decile assignment and dollar threshold to the DataFrame, but I couldn't figure out how to build the summary table using pandas' groupby, pivot_table, or crosstab. So, instead, I built the following summary table with the code below:

enter image description here

decile_dict = {
    "Decile 1": [0.074397, 750], "Decile 2": [0.027125, 750], 
    "Decile 3": [0.026380, 750], "Decile 4": [0.025737, 1000], 
    "Decile 5": [0.025058, 2500], "Decile 6": [0.024250, 4000],
    "Decile 7": [0.023185, 5000], "Decile 8": [0.022004, 5000], 
    "Decile 9": [0.021185, 10000], "Rejected": [0.000000, 0]
}

index = []
proba_threshold = []
count = []
percent_all = []
dollar_threshold = []
total_reviews = []
accounts_flipped = []
total_due_reviewed = []
flip_rate = []
total_due_per_flip = []
total_due_per_review = []
dollars_recovered_flipped = []
total_recovered_per_flip = []
total_recovered_per_review = []
total_fees_recv = []
total_fees_recv_per_flip = []
total_fees_recv_per_review = []

df_predictions = df

for key, value in decile_dict.items():
    df_decile = df_predictions[df_predictions.proba > value[0]]
    df_predictions = df_predictions[df_predictions.proba <= value[0]]

    index.append(key)
    proba_threshold.append(str('{:,.5f}'.format(value[0])))
    count.append(f"{df_decile.shape[0]:,}")
    percent_all.append(f"{round(df_decile.shape[0] / df.shape[0] * 100.0, 1)}%")
    dollar_threshold.append(f"${value[1]}")

    df_reviewed = df_decile[df_decile.total_due > value[1]]
    total_reviews.append(f"{df_reviewed.shape[0]:,}")
    accounts_flipped.append(f"{df_reviewed.flipped.sum():,}")
    flip_rate.append(f"{df_reviewed.flipped.mean() * 100.0:,.1f}%")

    total_due_reviewed.append(f"${df_reviewed.total_due.sum():,.2f}")
    total_due_per_flip.append(
        f"${df_reviewed.total_due.sum() / df_reviewed.flipped.sum():,.2f}")
    total_due_per_review.append(f"${df_reviewed.total_due.mean():,.2f}")

    dollars_recovered_flipped.append(f"${df_reviewed.net_returned.sum():,.2f}")
    total_recovered_per_flip.append(
        f"${df_reviewed.net_returned.sum() / df_reviewed.flipped.sum():,.2f}")
    total_recovered_per_review.append(f"${df_reviewed.net_returned.mean():,.2f}")

    total_fees_recv.append(f"${df_reviewed.total_fees.sum():,.2f}")
    total_fees_recv_per_flip.append(
        f"${df_reviewed.total_fees.sum() / df_reviewed.flipped.sum():,.2f}")
    total_fees_recv_per_review.append(f"${df_reviewed.total_fees.mean():,.2f}")

data = {
    "Proba": proba_threshold,
    "Count": count,
    "% All": percent_all,
    "Threshold": dollar_threshold,
    "Reviewed": total_reviews,
    "Flipped": accounts_flipped,
    "FlipRate": flip_rate,
    "TotalDueRevd": total_due_reviewed,
    "DuePerFlip": total_due_per_flip,
    "DuePerReview": total_due_per_review,
    "TotalRecovered": dollars_recovered_flipped,
    "RecvdPerFlip":  total_recovered_per_flip,
    "RecvdPerReview":  total_recovered_per_review,
    "TotalFees": total_fees_recv,
    "FeesPerFlip": total_fees_recv_per_flip,
    "FeesPerReview": total_fees_recv_per_review,    
}
df_results = pd.DataFrame(data=data, index=index)

While I got the desired end result, I have to believe there is a much more elegant and brief solution in Python, especially considering I had to use lists to build my resulting DataFrame. I'm also afraid this will become a performance and/or memory issue when I attempt to run this solution on a dataset containing millions of records.

So, that's really my question to all you Python experts: is there a better way to solve this problem than the solution I came up with?

For completeness, this is what the table looked like if I performed preprocessing, and added the decile assignment and dollar threshold to it:

enter image description here

def assign_decile(proba):
    decile = "Rejected"
    threshold = 0.0

    if proba > 0.074397:
        decile = "Decile 1"
        threshold = 750
    elif proba > 0.027125:
        decile = "Decile 2"
        threshold = 750
    elif proba > 0.026380:
        decile = "Decile 3"
        threshold = 750
    elif proba > 0.025737:
        decile = "Decile 4"
        threshold = 100
    elif proba > 0.025058:
        decile = "Decile 5"
        threshold = 2500
    elif proba > 0.024250:
        decile = "Decile 6"
        threshold = 4000
    elif proba > 0.023185:
        decile = "Decile 7"
        threshold = 5000
    elif proba > 0.022004:
        decile = "Decile 8"
        threshold = 5000
    elif proba > 0.021185:
        decile = "Decile 9"
        threshold = 10000

    return decile, threshold

df_predictions = df
df_predictions["decile"], df_predictions["threshold"] = zip(
    *df_predictions.proba.apply(assign_decile))

Solution

  • If you only want to generate a copy of your source DataFrame (df) with appended decile and threshold columns, you can do it using pandasonic cut method.

    But this method requires bin borders in ascending order. So the first step is to extract bin borders from decile_dict, reverse their order and append 1.0 (the upper limit of the last bin):

    bord = list(map(lambda lst: lst[0], decile_dict.values()))
    bord.reverse()
    bord.append(1.0)
    

    You should also pass bin labels, as they will be used as values of decile column. Their order should also be reversed:

    lab = list(decile_dict.keys())
    lab.reverse()
    

    Then, to generate the result, create a copy of the source DataFrame

    result = df.copy()
    

    and append 2 new columns:

    result['decile'] = pd.cut(df.proba, bins=bord, labels=lab)
    result['threshold'] = result.decile.map(lambda kk: decile_dict[kk][1])
    

    But if you want also to generate the second DataFrame (statistics per decile), then a more pandasonic approach is to create a function to compute this statistics:

    def decStat(grp):
        dec = grp.decile.iloc[0]
        val = decile_dict[dec]
        pt = str('{:,.5f}'.format(val[0]))
        rCnt = grp.shape[0]
        cnt = f"{rCnt:,}"
        pa = f"{round(rCnt / df.shape[0] * 100.0, 1)}%"
        thr = f"${val[1]}"
        rv = grp[grp.total_due > val[1]]
        rvd = f"{rv.shape[0]:,}"
        flp = f"{rv.flipped.sum():,}"
        flpRate = f"{rv.flipped.mean() * 100.0:,.1f}%"
        tdr = f"${rv.total_due.sum():,.2f}"
        dpf = f"${rv.total_due.sum() / rv.flipped.sum():,.2f}"
        dpr = f"${rv.total_due.mean():,.2f}"
        tRec = f"${rv.net_returned.sum():,.2f}"
        rpf = f"${rv.net_returned.sum() / rv.flipped.sum():,.2f}"
        rpr = f"${rv.net_returned.mean():,.2f}"
        fees = f"${rv.total_fees.sum():,.2f}"
        fpf = f"${rv.total_fees.sum() / rv.flipped.sum():,.2f}"
        fpr = f"${rv.total_fees.mean():,.2f}"
        return pd.Series([pt, cnt, pa, thr, rvd, flp, flpRate, tdr, dpf, dpr, tRec,
                rpf, rpr, fees, fpf, fpr],
            index=['Proba', 'Count', '% All', 'Threshold', 'Reviewed', 'Flipped',
                'FlipRate', 'TotalDueRevd', 'DuePerFlip', 'DuePerReview', 'TotalRecovered',
                'RecvdPerFlip', 'RecvdPerReview', 'TotalFees', 'FeesPerFlip', 'FeesPerReview'])
    

    Then group the above result by decile and apply this function:

    resultByDec = result.groupby('decile').apply(decStat).sort_index()
    

    This approach should be substantially faster than yours, especially for larger amount of source data.