I am trying to clean a dataframe by deleting the wrongly added rows.
This is the dummy data :
temp <- structure(list(Date = c("24/06/2002", "24/06/2002", "25/06/2002","25/06/2002", "26/06/2002",
"02/07/2002","03/07/2002","24/07/2002", "08/07/2002",
"08/07/2002", "15/07/2002", "17/07/2002",
"22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"),
payment = c(200, 1000,-1000, -1000, 1000,
-1000,-1000,-1000, 1200,
-1200, 1200, 1200,
200, 56700, -56700, -200),
Code = c("ABC", "M567", "M567","M567", "XYZ", "M567", "ABX" ,
"M567","M567", "M567",
"M567", "M567", "M300",
"M678", "M678", "ABC"),
ID = c(NA, "98","187","187","12ee","M11","M13",
NA,"K999",
"K999", "111", "111", "11",
"12345", NA, NA)), row.names = c(NA, -16L), class = "data.frame")
The dataframe looks this this
Date payment Code ID
1 24/06/2002 200 ABC <NA>
2 24/06/2002 1000 M567 98
3 25/06/2002 -1000 M567 187
4 25/06/2002 -1000 M567 187
5 26/06/2002 1000 XYZ 12ee
6 02/07/2002 -1000 M567 M11
7 03/07/2002 -1000 ABX M13
8 24/07/2002 -1000 M567 <NA>
9 08/07/2002 1200 M567 K999
10 08/07/2002 -1200 M567 K999
11 15/07/2002 1200 M567 111
12 17/07/2002 1200 M567 111
13 22/07/2002 200 M300 11
14 22/07/2002 56700 M678 12345
15 28/07/2002 -56700 M678 <NA>
16 29/07/2002 -200 ABC <NA>
As you can see there are some positive and negative payments in the data. The negative payments are the wrongly added transactions or refunds.
For example +1200 will cancel out with -1200 based on Code and ID whereas row 14 and 15 are similar but the ID is NA - So I have to fill that with ID of its positive payment row and viseversa. So that I can delete those both rows.
Code I tried with the help of a programmer on StackOverflow (previously asked):
library(dplyr)
library(data.table)
library(tidyr)
Final_df <- df1 %>%
group_by(Code) %>%
mutate(ind = rowid(payment)) %>%
group_by(ind, .add = TRUE) %>%
fill(ID, .direction = 'downup') #%>%
ungroup %>%
mutate(absPayment = abs(payment)) %>%
arrange(ID, Code, absPayment) %>%
group_by(Code, ID, absPayment) %>%
mutate(grp = rowid(sign(payment))) %>%
group_by(grp, .add = TRUE) %>%
filter(n() == 1) %>%
ungroup %>%
select(names(df1))
But the problem here is row 8 - 24/07/2002 -1000 M567 should be filled by row 2 as the code and positive payment is matched - so that later I can cancel these both rows. Since the row is far away from row 8 .direction = 'downup' is not working.
And I think there is a better way to fill NA's other than using direction( As it is not getting applied with similar rows far away)
The Expected output is:
Date payment Code ID
1 25/06/2002 -1000 M567 187
2 25/06/2002 -1000 M567 187
3 26/06/2002 1000 XYZ 12ee
4 02/07/2002 -1000 M567 M11
5 03/07/2002 -1000 ABX M13
6 15/07/2002 1200 M567 111
7 17/07/2002 1200 M567 111
8 22/07/2002 200 M300 11
I am struck at this since 5 days. Any solutions would be really helpful.
Thanks in advance
Another possible Dummy data:
temp_2 <- structure(list(Date = c("22/06/2002", "23/06/2002","24/06/2002", "25/06/2002","25/06/2002", "26/06/2002",
"02/07/2002","03/07/2002","24/07/2002", "08/07/2002",
"08/07/2002", "15/07/2002", "17/07/2002",
"22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"),
payment = c(200,-1000, 1000,-1000, -1000, 1000,
-1000,-1000,-1000, 1200,
-1200, 1200, 1200,
200, 56700, -56700, -200),
Code = c("ABC", "M567","M567", "M567","M567", "XYZ", "M567", "ABX" ,
"M567","M567", "M567",
"M567", "M567", "M300",
"M678", "M678", "ABC"),
ID = c(NA,"187", "98","187","187","12ee",NA,NA,
NA,"K999",
"K999", "111", "111", "11",
"12345", NA, NA)), row.names = c(NA, -17L), class = "data.frame")
Expected Output for temp_2:
Date payment Code ID
1 23/06/2002 -1000 M567 187
2 25/06/2002 -1000 M567 187
3 25/06/2002 -1000 M567 187
4 26/06/2002 1000 XYZ 12ee
5 03/07/2002 -1000 ABX <NA>
6 24/07/2002 -1000 M567 98
7 15/07/2002 1200 M567 111
8 17/07/2002 1200 M567 111
9 22/07/2002 200 M300 11
We can use
library(dplyr)
library(data.table)
f1 <- function(dat) {
i1 <- is.na(dat$ID) & nrow(dat) > 1
if(any(i1)) {
dat$ID[i1] <- dat$ID[!i1][match(dat$payment[i1],
-dat$payment[!i1])]
}
return(dat)
}
temp %>%
mutate(rn = row_number()) %>%
group_by(Code, absPayment = abs(payment)) %>%
filter(sum(payment) != 0) %>%
group_modify(~ f1(.x)) %>%
group_by(ID, .add = TRUE) %>%
mutate(grp = rowid(sign(payment))) %>%
group_by(grp, .add = TRUE) %>%
filter(n() == 1) %>%
ungroup %>%
arrange(rn) %>%
select(names(temp))
-output
# A tibble: 8 × 4
Date payment Code ID
<chr> <dbl> <chr> <chr>
1 25/06/2002 -1000 M567 187
2 25/06/2002 -1000 M567 187
3 26/06/2002 1000 XYZ 12ee
4 02/07/2002 -1000 M567 M11
5 03/07/2002 -1000 ABX M13
6 15/07/2002 1200 M567 111
7 17/07/2002 1200 M567 111
8 22/07/2002 200 M300 11
For the second case
temp_2 %>%
mutate(rn = row_number()) %>%
group_by(Code, absPayment = abs(payment)) %>%
filter(sum(payment) != 0) %>%
group_modify(~ f1(.x)) %>%
group_by(ID, .add = TRUE) %>%
mutate(grp = rowid(sign(payment))) %>%
group_by(grp, .add = TRUE) %>%
filter(n() == 1) %>%
ungroup %>%
arrange(rn) %>%
select(names(temp_2))
-output
# A tibble: 9 × 4
Date payment Code ID
<chr> <dbl> <chr> <chr>
1 23/06/2002 -1000 M567 187
2 25/06/2002 -1000 M567 187
3 25/06/2002 -1000 M567 187
4 26/06/2002 1000 XYZ 12ee
5 03/07/2002 -1000 ABX <NA>
6 24/07/2002 -1000 M567 98
7 15/07/2002 1200 M567 111
8 17/07/2002 1200 M567 111
9 22/07/2002 200 M300 11