Search code examples
pythonpandaspython-itertoolsfraud-prevention

How to create a dataframe with combinations of values within a column


Alright guys, this one's tough.

The subject of this ask is based on fraud data, rules and payments - i think it's easier to picture what i'm trying to do with a little background.

I have a dataframe that consists of accountids, amounts, dates, fraud etc. and one column called rule. There can be many unique rules that fired per accountid.

accountid    amount    date      rule   fraud
123          5         20191101  rule_1  fraud
123          10        20191102  rule_2  fraud
456          50        20191101  rule_1  nonfraud
456          50        20191101  rule_2  nonfraud
456          50        20191101  rule_3  nonfraud
456          50        20191101  rule_4  nonfraud

My goal is to create a function that iterates through each combination of 2 rules (maybe 3 later on), and have a couple other columns that count accountids, count number of fraud and nonfraud accountids, fraudrate etc. result looks like this:

rule_combo       count_acct     count_fraud   count_nonfraud    fraudrate
rule_1,rule_2    5              3             2                 .6
rule_2,rule_3    20             10            10                .5
rule_2,rule_4    50             10            40                .1
rule_1,rule_3    10             3             7                 .3

I have a function that looks like this that will do what I want but it uses columns (headers):

def combo_cols(df,agg,cols,n,minrate=0)
    combos = list(itertools.combinations(cols,3))
    num_cols = ['col1','col2','col3']
    res = []
    for combo in combos:
        fr = fraudrate(df,agg,combo) #using another function
        accounts = df.groupby(combo).apply(lambda df:list(df.accountid.unique()))
        accounts.name = 'accounts'
        accounts = accounts.to_frame()
        fr = accounts.join(fr)
        fr = fr.reset_index()
        fr['naccts'] = fr.apply(lamda df: len(set(df.accounts)),axis=1)
        fr.columns = num_cols + ['accounts','naccts','fraud','nonfraud','fraudrate','fpr']
        fr = fr.assign(groupcols = ', '.join(combo))
        fr = fr.loc[fr.fraudrate.gt(minrate)]
        res.append(fr)
    return pd.concat(res).sort_values(by='fraudrate',ascending=False)

I haven't been able to wrap my head around how to write a function that would do this for rules. Appreciate any help on this.


Solution

  • I'm not sure I am answering to your question, so please give me feedback and I will update if needed.

    My first bet would be to OneHotEncode those features you have. Here is an example:

    df = pd.get_dummies(df, columns=['rule', 'fraud'])

    Which result in this:

    accountid amount rule_rule_1 rule_rule_2 rule_rule_3 rule_rule_4 fraud_fraud fraud_nonfraud
    0   123     5        1           0           0           0           1           0
    1   123     10       0           1           0           0           1           0
    2   456     50       1           0           0           0           0           1
    3   456     50       0           1           0           0           0           1
    4   456     50       0           0           1           0           0           1
    5   456     50       0           0           0           1           0           1
    

    Then you could use itertools to make something like this:

    import itertools
    for elt in itertools.combinations(list(df.columns[df.columns.str.startswith('rule')]), 2):
        tmp = df.groupby(list(elt))
        # Apply your aggregation functions here
    

    Hope this helps!