r network-programming igraph social-networking network-analysis

How to filter the transaction data based on a cutoff value?

This is my transaction data:


data:

id          from_id        to_id      amount    date_trx
<fctr>      <fctr>         <fctr>     <dbl>     <date>
0           7468           5695       700.0     2005-01-04
1           6213           9379       11832.0   2005-01-08
2           7517           8170       1000.0    2005-01-10
3           6143           9845       4276.0    2005-01-12
4           6254           9640       200.0     2005-01-14
5           6669           5815       200.0     2005-01-20
6           6934           8583       49752.0   2005-01-24
7           9240           8314       19961.0   2005-01-26
8           6374           8865       1000.0    2005-01-30
9           6143           6530       13.4      2005-01-31
...

I want to filter data as follows:

If at any point in the transfer chain, the ratio of the money received by an account to the money leaving the initial sender is above a threshold value, say 0.9, ( that is 1 ≥ (second_transacted_amount / first_transacted_amount) > 0.9 ), then I want to extract these accounts and save them for later investigation.

For example, here the account "7468" sends 700.0 dollars to the account "5695" in 2005-01-04. After this transaction, let's say "5695" makes a transaction over 90 percent of 700.0 but not exactly 700.0. Here is another important point: The second transaction should always be ahead of the first. So, date_trx variable have importance in this case also. How can I do that for the whole data (in R)?

We can have a little dataset to test on:

id          from_id        to_id      amount    date_trx
<fctr>      <fctr>         <fctr>     <dbl>     <date>
0           A              B          200.0     2005-07-08
1           B              C          185.0     2005-08-20
2           B              E          50.0      2005-10-19
3           C              B          170.0     2005-05-08
4           C              D          40.0      2005-09-19  
5           D              F          38.0      2005-07-13

Considering the transactions from or to B,

0           A              B          200.0     2005-07-08
1           B              C          185.0     2005-08-20
2           B              E          50.0      2005-10-19
3           C              B          170.0     2005-05-08

due to the following transaction pair, B should be flagged as suspicious

0           A              B          200.0     2005-07-08
1           B              C          185.0     2005-08-20

With the same reasoning, considering the transactions from or to C

1           B              C          185.0     2005-08-20
3           C              B          170.0     2005-05-08
4           C              D          40.0      2005-09-19

C shouldn't be flagged as suspicious

Flagging can be done regarding from_id:

The output could be something similar to this:

id          from_id        to_id      amount    date_trx       suspicious
<fctr>      <fctr>         <fctr>     <dbl>     <date>         <fctr>
0           A              B          200.0     2005-07-08      N     
1           B              C          185.0     2005-08-20      Y
2           B              E          50.0      2005-10-19      Y
3           C              B          170.0     2005-05-08      N
4           C              D          40.0      2005-09-19      N   
5           D              F          38.0      2005-07-13      N

Solution

It can be done with SQL syntax using the sqldf package.

Here is how it works with your example

library(sqldf)

data <- data.frame(id = factor(c(0, 1, 2, 3, 4, 5)),
                   from_id = factor(c("A", "B", "B", "C", "C", "D")),
                   to_id = factor(c("B", "C", "E", "B", "D", "F")),
                   amount = c(200, 185, 50, 170, 40, 38),
                   date_trx = c(0708, 0820, 1019, 0508, 0919, 0713)))

Since all info is from one year, I treated the variable date_trx as month-day numeric, but it could be treated as year-month-day also.

The following query returns what is expected:

sqldf("select a.*, coalesce(b.suspicious, 'N') as suspicious
       from   data a
       left join (select distinct b.from_id, 'Y' as suspicious
                  from   data a
                  inner join data b
                             on a.to_id = b.from_id and 
                                a.date_trx < b.date_trx and 
                                b.amount/a.amount > 0.9) b
                  on a.from_id = b.from_id")

which returns in the R console

  id from_id to_id amount date_trx suspicious
1  0       A     B    200      708          N
2  1       B     C    185      820          Y
3  2       B     E     50     1019          Y
4  3       C     B    170      508          N
5  4       C     D     40      919          N
6  5       D     F     38      713          N

It seems that you are working to detect fraud, if it is the case, this approach does not works in real-time but you can look at figure 1 in this paper https://arxiv.org/pdf/2002.05988.pdf to have an idea of how to implement code for doing a real-time suspicious detection.