This is my transaction data:
data:
id from_id to_id amount date_trx
<fctr> <fctr> <fctr> <dbl> <date>
0 7468 5695 700.0 2005-01-04
1 6213 9379 11832.0 2005-01-08
2 7517 8170 1000.0 2005-01-10
3 6143 9845 4276.0 2005-01-12
4 6254 9640 200.0 2005-01-14
5 6669 5815 200.0 2005-01-20
6 6934 8583 49752.0 2005-01-24
7 9240 8314 19961.0 2005-01-26
8 6374 8865 1000.0 2005-01-30
9 6143 6530 13.4 2005-01-31
...
I want to filter data as follows:
If at any point in the transfer chain, the ratio of the money received by an account to the money leaving the initial sender is above a threshold value, say 0.9, ( that is 1 ≥ (second_transacted_amount / first_transacted_amount) > 0.9 ), then I want to extract these accounts and save them for later investigation.
For example, here the account "7468" sends 700.0 dollars to the account "5695" in 2005-01-04. After this transaction, let's say "5695" makes a transaction over 90 percent of 700.0 but not exactly 700.0. Here is another important point: The second transaction should always be ahead of the first. So, date_trx variable have importance in this case also. How can I do that for the whole data (in R)?
We can have a little dataset to test on:
id from_id to_id amount date_trx
<fctr> <fctr> <fctr> <dbl> <date>
0 A B 200.0 2005-07-08
1 B C 185.0 2005-08-20
2 B E 50.0 2005-10-19
3 C B 170.0 2005-05-08
4 C D 40.0 2005-09-19
5 D F 38.0 2005-07-13
Considering the transactions from or to B,
0 A B 200.0 2005-07-08
1 B C 185.0 2005-08-20
2 B E 50.0 2005-10-19
3 C B 170.0 2005-05-08
due to the following transaction pair, B should be flagged as suspicious
0 A B 200.0 2005-07-08
1 B C 185.0 2005-08-20
With the same reasoning, considering the transactions from or to C
1 B C 185.0 2005-08-20
3 C B 170.0 2005-05-08
4 C D 40.0 2005-09-19
C shouldn't be flagged as suspicious
Flagging can be done regarding from_id:
The output could be something similar to this:
id from_id to_id amount date_trx suspicious
<fctr> <fctr> <fctr> <dbl> <date> <fctr>
0 A B 200.0 2005-07-08 N
1 B C 185.0 2005-08-20 Y
2 B E 50.0 2005-10-19 Y
3 C B 170.0 2005-05-08 N
4 C D 40.0 2005-09-19 N
5 D F 38.0 2005-07-13 N
It can be done with SQL syntax using the sqldf
package.
Here is how it works with your example
library(sqldf)
data <- data.frame(id = factor(c(0, 1, 2, 3, 4, 5)),
from_id = factor(c("A", "B", "B", "C", "C", "D")),
to_id = factor(c("B", "C", "E", "B", "D", "F")),
amount = c(200, 185, 50, 170, 40, 38),
date_trx = c(0708, 0820, 1019, 0508, 0919, 0713)))
Since all info is from one year, I treated the variable date_trx
as month-day numeric, but it could be treated as year-month-day also.
The following query returns what is expected:
sqldf("select a.*, coalesce(b.suspicious, 'N') as suspicious
from data a
left join (select distinct b.from_id, 'Y' as suspicious
from data a
inner join data b
on a.to_id = b.from_id and
a.date_trx < b.date_trx and
b.amount/a.amount > 0.9) b
on a.from_id = b.from_id")
which returns in the R console
id from_id to_id amount date_trx suspicious
1 0 A B 200 708 N
2 1 B C 185 820 Y
3 2 B E 50 1019 Y
4 3 C B 170 508 N
5 4 C D 40 919 N
6 5 D F 38 713 N
It seems that you are working to detect fraud, if it is the case, this approach does not works in real-time but you can look at figure 1 in this paper https://arxiv.org/pdf/2002.05988.pdf to have an idea of how to implement code for doing a real-time suspicious detection.