Search code examples
rmodelforecasting

Approach to monthly repayment data by customer's in R


I have a data which look like this

ReportingMonth  AmountPaid  AccountId  Country  InitilBalance CurrentBalance
201501           10          112233     UK       100           90
201502           10          112233     UK       100           80
201503           20          112233     UK       100           60  
201501           40          221144     US       200           160
201502           60          221144     US       200           100
201503           100         221144     US       200           0

Earlier I tried to convert AccountId as a factor so that model knows how a particular AccountId is performing however, that doesn't work and R is throwing out of memory error on just 0.1+ million data set. The goal is to predict next month (ReportingMonth) payment by a particular AccountId.

I want to start with a linear model and then move up if required, how to proceed with this kind of problem?

Edit1 (changed title to be more specific): - I have historical details of many account id's (repayment of a loan by month) and need to predict repayment of a loan by them in future. Started with linear regression to just see data in depth, however, I am currently looking for how to model and approach "loan repayment data"


Solution

  • I'm not sure what sort of regression models you are intending to run, as that would require some basic statistical understanding, knowing your data, and knowing what is it you are trying to do. Also, trying to predict with regression for something that is outside the observation range is ridiculously inaccurate. There are better means of doing that.

    Nonetheless, here's an example of a linear regression and how one could go about predicting the data:

    (1) converting the AccountId as a factor, changing the Reporting Month into a date and numerical date format:

    library(dplyr)
    df <- df %>% 
          mutate(id = as.factor(AccountId)) %>% 
          mutate(date = paste0(ReportingMonth, "01") %>% as.Date(., format="%Y%m%d")) %>% 
          mutate(day = as.numeric(date))
    

    (2) Run a linear regression model AmountPaid ~ day + day:id + id.

    fit <- lm(AmountPaid ~ day*id, data=df)
    summary(fit)
    
    Call:
    lm(formula = AmountPaid ~ day * id, data = df)
    
    Residuals:
         1      2      3      4      5      6 
     1.661 -3.500  1.839  3.643 -7.677  4.034 
    
    Coefficients:
                   Estimate Std. Error t value Pr(>|t|)  
    (Intercept)  -2.728e+03  2.884e+03  -0.946   0.4440  
    day           1.665e-01  1.751e-01   0.951   0.4422  
    id221144     -1.384e+04  4.078e+03  -3.394   0.0769 .
    day:id221144  8.439e-01  2.477e-01   3.407   0.0764 .
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 7.31 on 2 degrees of freedom
    Multiple R-squared:  0.9828,    Adjusted R-squared:  0.9569 
    F-statistic: 38.01 on 3 and 2 DF,  p-value: 0.02574
    

    What the summary means is basically:

    AmountPaid for id 112233 = 0.17*day - 2728 &
    AmountPaid for id 221144 = (0.17+0.84)*day + (-2728 - 13840)
    

    To predict the AmountPaid for each id monthly in 2015:

    newdat <- expand.grid(date=seq.Date(as.Date("2015-01-01"), as.Date("2016-01-01"), by="month"),
                      id=levels(df$id))
    
    
    newdat$day <- newdat$date %>% as.numeric
    newdat[, c('fit', 'se')] <- predict(fit, newdata=newdat, se=T)[1:2]
    

    Plotting out the fitted graph:

    ggplot(newdat, aes(date, fit)) + geom_line() + geom_ribbon(aes(ymin=fit-1.96*se, ymax=fit+1.96*se), alpha=.2) + 
      geom_point(data=df, aes(date, AmountPaid)) + 
      ylab("AmountPaid") +
      scale_x_date(date_break="1 month", date_labels="%Y-%m") + 
      theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
      facet_wrap(~id)
    

    enter image description here