Search code examples
rnetwork-programmingigraphsocial-networkingnetwork-analysis

How to filter the transaction data based on a cutoff value?


This is my transaction data:


data:

id          from_id        to_id      amount    date_trx
<fctr>      <fctr>         <fctr>     <dbl>     <date>
0           7468           5695       700.0     2005-01-04
1           6213           9379       11832.0   2005-01-08
2           7517           8170       1000.0    2005-01-10
3           6143           9845       4276.0    2005-01-12
4           6254           9640       200.0     2005-01-14
5           6669           5815       200.0     2005-01-20
6           6934           8583       49752.0   2005-01-24
7           9240           8314       19961.0   2005-01-26
8           6374           8865       1000.0    2005-01-30
9           6143           6530       13.4      2005-01-31
...

I want to filter data as follows:

If at any point in the transfer chain, the ratio of the money received by an account to the money leaving the initial sender is above a threshold value, say 0.9, ( that is 1 ≥ (second_transacted_amount / first_transacted_amount) > 0.9 ), then I want to extract these accounts and save them for later investigation.

For example, here the account "7468" sends 700.0 dollars to the account "5695" in 2005-01-04. After this transaction, let's say "5695" makes a transaction over 90 percent of 700.0 but not exactly 700.0. Here is another important point: The second transaction should always be ahead of the first. So, date_trx variable have importance in this case also. How can I do that for the whole data (in R)?

We can have a little dataset to test on:

id          from_id        to_id      amount    date_trx
<fctr>      <fctr>         <fctr>     <dbl>     <date>
0           A              B          200.0     2005-07-08
1           B              C          185.0     2005-08-20
2           B              E          50.0      2005-10-19
3           C              B          170.0     2005-05-08
4           C              D          40.0      2005-09-19  
5           D              F          38.0      2005-07-13

Considering the transactions from or to B,

0           A              B          200.0     2005-07-08
1           B              C          185.0     2005-08-20
2           B              E          50.0      2005-10-19
3           C              B          170.0     2005-05-08

due to the following transaction pair, B should be flagged as suspicious

0           A              B          200.0     2005-07-08
1           B              C          185.0     2005-08-20

With the same reasoning, considering the transactions from or to C

1           B              C          185.0     2005-08-20
3           C              B          170.0     2005-05-08
4           C              D          40.0      2005-09-19

C shouldn't be flagged as suspicious

Flagging can be done regarding from_id:

The output could be something similar to this:

id          from_id        to_id      amount    date_trx       suspicious
<fctr>      <fctr>         <fctr>     <dbl>     <date>         <fctr>
0           A              B          200.0     2005-07-08      N     
1           B              C          185.0     2005-08-20      Y
2           B              E          50.0      2005-10-19      Y
3           C              B          170.0     2005-05-08      N
4           C              D          40.0      2005-09-19      N   
5           D              F          38.0      2005-07-13      N

Solution

  • It can be done with SQL syntax using the sqldf package.

    Here is how it works with your example

    library(sqldf)
    
    data <- data.frame(id = factor(c(0, 1, 2, 3, 4, 5)),
                       from_id = factor(c("A", "B", "B", "C", "C", "D")),
                       to_id = factor(c("B", "C", "E", "B", "D", "F")),
                       amount = c(200, 185, 50, 170, 40, 38),
                       date_trx = c(0708, 0820, 1019, 0508, 0919, 0713)))
    

    Since all info is from one year, I treated the variable date_trx as month-day numeric, but it could be treated as year-month-day also.

    The following query returns what is expected:

    sqldf("select a.*, coalesce(b.suspicious, 'N') as suspicious
           from   data a
           left join (select distinct b.from_id, 'Y' as suspicious
                      from   data a
                      inner join data b
                                 on a.to_id = b.from_id and 
                                    a.date_trx < b.date_trx and 
                                    b.amount/a.amount > 0.9) b
                      on a.from_id = b.from_id")
    

    which returns in the R console

      id from_id to_id amount date_trx suspicious
    1  0       A     B    200      708          N
    2  1       B     C    185      820          Y
    3  2       B     E     50     1019          Y
    4  3       C     B    170      508          N
    5  4       C     D     40      919          N
    6  5       D     F     38      713          N
    

    It seems that you are working to detect fraud, if it is the case, this approach does not works in real-time but you can look at figure 1 in this paper https://arxiv.org/pdf/2002.05988.pdf to have an idea of how to implement code for doing a real-time suspicious detection.