Search code examples
rdataframemergeconditional-formattingfeature-engineering

creating new columns based on several conditions in R


I have a data frame consisting of three columns and the unique values for status are as follows "X" "0" "C" "1" "2" "3" "4" "5". In the beginning, I do not know how to group by each id and create several columns according to the conditions, for instance, a target column that is 1 if the status is 2, 3, 4, 5, and else is zero.

month_balance represents (The month of the extracted data is the starting point, backwards, 0 is the current month, -1 is the previous month, and so on)

status represents (0: 1-29 days past due, 1: 30-59 days past due, 2: 60-89 days overdue, 3: 90-119 days overdue, 4: 120-149 days overdue, 5: Overdue or bad debts write-offs for more than 150 days C: paid off that month, X: No loan for the month)

df <- data.frame (id  = c("5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805"),
                  month_balance = c("0","-1","-2","-3","-4","-5","-6","-7","-8","-9","-10","-11","-12","-13","-14","-15","0","-1","-2","-3","-4","-5","-6","-7","-8","-9","-10","-11","-12","-13","-14"),
                  status = c("C","C","C","C","C","C","C","C","C","C","C","C","C","1","0","X","C","C","C","C","C","C","C","C","C","C","C","C","1","0","X")
                  )

In the end, I want to reach output as below:

df1 <- data.frame (id  = c("5008804","5008805"),
                  month_begin = c("16","15"),
                  paid_off = c("13","12"),
                  num_of_pastdues = c("2","2"),
                  no_loan = c("1","1"),
                  target = c("0","0"))

Solution

  • A base R solution can be to create a custom function and apply it on each group, i.e.

    MyFunction <- function(x){
      month_begin = length(x)
      paid_off = sum(x == 'C')
      num_of_pastdues = sum(x %in% 0:5)
      no_loan = sum(x == 'X')
      target = ifelse(any(x %in% 2:5), 1, 0)
      return(c(month_begin=month_begin, paid_off=paid_off, num_of_pastdues=num_of_pastdues, no_loan=no_loan, target=target))
    }
    
    res <- t(sapply(split(df$status, df$id), MyFunction))
    
                 month_begin paid_off num_of_pastdues no_loan target
    #    5008804 16          13       2               1       0     
    #    5008805 15          12       2               1       0
    

    To make it a data frame with the column id then,

    res_df <- data.frame(res)
    res_df$id <- rownames(res_df)
    rownames(res_df) <- NULL
    
    res_df
    
    #month_begin paid_off num_of_pastdues no_loan target      id
    #1          16       13               2       1      0 5008804
    #2          15       12               2       1      0 5008805