Search code examples
rdataframedata-cleaning

cleaning the data-frame with conditions


I am trying to clean a dataframe by deleting the wrongly added rows.

This is the dummy data :

temp <- structure(list(Date = c("24/06/2002", "24/06/2002", "25/06/2002","25/06/2002", "26/06/2002", 
                               "02/07/2002","03/07/2002","24/07/2002", "08/07/2002",
                               "08/07/2002", "15/07/2002", "17/07/2002", 
                               "22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"), 
                      payment = c(200, 1000,-1000, -1000, 1000,
                                  -1000,-1000,-1000, 1200,
                                  -1200, 1200, 1200,
                                  200, 56700, -56700, -200), 
                      Code = c("ABC", "M567", "M567","M567", "XYZ", "M567", "ABX" ,
                               "M567","M567", "M567", 
                               "M567", "M567", "M300", 
                               "M678", "M678", "ABC"), 
                      ID = c(NA, "98","187","187","12ee","M11","M13",
                             NA,"K999", 
                             "K999", "111", "111", "11",
                             "12345", NA, NA)), row.names = c(NA, -16L), class = "data.frame")

The dataframe looks this this

         Date payment Code    ID
1  24/06/2002     200  ABC  <NA>
2  24/06/2002    1000 M567    98
3  25/06/2002   -1000 M567   187
4  25/06/2002   -1000 M567   187
5  26/06/2002    1000  XYZ  12ee
6  02/07/2002   -1000 M567   M11
7  03/07/2002   -1000  ABX   M13
8  24/07/2002   -1000 M567  <NA>
9  08/07/2002    1200 M567  K999
10 08/07/2002   -1200 M567  K999
11 15/07/2002    1200 M567   111
12 17/07/2002    1200 M567   111
13 22/07/2002     200 M300    11
14 22/07/2002   56700 M678 12345
15 28/07/2002  -56700 M678  <NA>
16 29/07/2002    -200  ABC  <NA>

As you can see there are some positive and negative payments in the data. The negative payments are the wrongly added transactions or refunds.

For example +1200 will cancel out with -1200 based on Code and ID whereas row 14 and 15 are similar but the ID is NA - So I have to fill that with ID of its positive payment row and viseversa. So that I can delete those both rows.

Code I tried with the help of a programmer on StackOverflow (previously asked):

library(dplyr)
library(data.table)
library(tidyr)
Final_df <- df1 %>% 
  group_by(Code) %>%
  mutate(ind = rowid(payment)) %>%
  group_by(ind, .add = TRUE) %>% 
  fill(ID, .direction = 'downup') #%>% 
  ungroup %>%
  mutate(absPayment = abs(payment)) %>% 
  arrange(ID, Code, absPayment) %>%
  group_by(Code, ID, absPayment) %>%
  mutate(grp = rowid(sign(payment))) %>% 
  group_by(grp, .add = TRUE) %>%
  filter(n() == 1) %>% 
  ungroup %>%   
  select(names(df1)) 

But the problem here is row 8 - 24/07/2002 -1000 M567 should be filled by row 2 as the code and positive payment is matched - so that later I can cancel these both rows. Since the row is far away from row 8 .direction = 'downup' is not working.

And I think there is a better way to fill NA's other than using direction( As it is not getting applied with similar rows far away)

The Expected output is:

         Date payment Code    ID

1  25/06/2002   -1000 M567   187
2  25/06/2002   -1000 M567   187
3  26/06/2002    1000  XYZ  12ee
4  02/07/2002   -1000 M567   M11
5  03/07/2002   -1000  ABX   M13
6  15/07/2002    1200 M567   111
7  17/07/2002    1200 M567   111
8  22/07/2002     200 M300    11

I am struck at this since 5 days. Any solutions would be really helpful.

Thanks in advance

Another possible Dummy data:

temp_2 <-  structure(list(Date = c("22/06/2002", "23/06/2002","24/06/2002", "25/06/2002","25/06/2002", "26/06/2002", 
                               "02/07/2002","03/07/2002","24/07/2002", "08/07/2002",
                               "08/07/2002", "15/07/2002", "17/07/2002", 
                               "22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"), 
                      payment = c(200,-1000, 1000,-1000, -1000, 1000,
                                  -1000,-1000,-1000, 1200,
                                  -1200, 1200, 1200,
                                  200, 56700, -56700, -200), 
                      Code = c("ABC", "M567","M567", "M567","M567", "XYZ", "M567", "ABX" ,
                               "M567","M567", "M567", 
                               "M567", "M567", "M300", 
                               "M678", "M678", "ABC"), 
                      ID = c(NA,"187", "98","187","187","12ee",NA,NA,
                             NA,"K999", 
                             "K999", "111", "111", "11",
                             "12345", NA, NA)), row.names = c(NA, -17L), class = "data.frame")

Expected Output for temp_2:

         Date payment Code    ID

1  23/06/2002   -1000 M567   187
2  25/06/2002   -1000 M567   187
3  25/06/2002   -1000 M567   187
4  26/06/2002    1000  XYZ  12ee
5  03/07/2002   -1000  ABX  <NA>
6  24/07/2002   -1000 M567   98
7 15/07/2002    1200 M567   111
8 17/07/2002    1200 M567   111
9 22/07/2002     200 M300    11

Solution

  • We can use

    library(dplyr)
    library(data.table)
    f1 <- function(dat) {
            i1 <- is.na(dat$ID) & nrow(dat) > 1
             if(any(i1)) {
                  dat$ID[i1] <- dat$ID[!i1][match(dat$payment[i1], 
                    -dat$payment[!i1])]
                   }
                return(dat)
                }
                
    temp %>%
       mutate(rn = row_number()) %>%
       group_by(Code, absPayment = abs(payment)) %>%
       filter(sum(payment) != 0) %>%
       group_modify(~ f1(.x)) %>%
       group_by(ID, .add = TRUE) %>%
       mutate(grp = rowid(sign(payment))) %>% 
       group_by(grp, .add = TRUE) %>%
       filter(n() == 1) %>% 
       ungroup %>%
        arrange(rn) %>%
       select(names(temp)) 
    

    -output

    # A tibble: 8 × 4
      Date       payment Code  ID   
      <chr>        <dbl> <chr> <chr>
    1 25/06/2002   -1000 M567  187  
    2 25/06/2002   -1000 M567  187  
    3 26/06/2002    1000 XYZ   12ee 
    4 02/07/2002   -1000 M567  M11  
    5 03/07/2002   -1000 ABX   M13  
    6 15/07/2002    1200 M567  111  
    7 17/07/2002    1200 M567  111  
    8 22/07/2002     200 M300  11   
    

    For the second case

     temp_2 %>%
       mutate(rn = row_number()) %>% 
       group_by(Code, absPayment = abs(payment)) %>%
       filter(sum(payment) != 0) %>%
       group_modify(~ f1(.x)) %>%
       group_by(ID, .add = TRUE) %>%
       mutate(grp = rowid(sign(payment))) %>% 
       group_by(grp, .add = TRUE) %>%
       filter(n() == 1) %>% 
       ungroup %>%
       arrange(rn) %>%
       select(names(temp_2))
    

    -output

    # A tibble: 9 × 4
      Date       payment Code  ID   
      <chr>        <dbl> <chr> <chr>
    1 23/06/2002   -1000 M567  187  
    2 25/06/2002   -1000 M567  187  
    3 25/06/2002   -1000 M567  187  
    4 26/06/2002    1000 XYZ   12ee 
    5 03/07/2002   -1000 ABX   <NA> 
    6 24/07/2002   -1000 M567  98   
    7 15/07/2002    1200 M567  111  
    8 17/07/2002    1200 M567  111  
    9 22/07/2002     200 M300  11