I have a data which look like this
ReportingMonth AmountPaid AccountId Country InitilBalance CurrentBalance
201501 10 112233 UK 100 90
201502 10 112233 UK 100 80
201503 20 112233 UK 100 60
201501 40 221144 US 200 160
201502 60 221144 US 200 100
201503 100 221144 US 200 0
Earlier I tried to convert AccountId as a factor so that model knows how a particular AccountId is performing however, that doesn't work and R is throwing out of memory error on just 0.1+ million data set. The goal is to predict next month (ReportingMonth) payment by a particular AccountId.
I want to start with a linear model and then move up if required, how to proceed with this kind of problem?
Edit1 (changed title to be more specific): - I have historical details of many account id's (repayment of a loan by month) and need to predict repayment of a loan by them in future. Started with linear regression to just see data in depth, however, I am currently looking for how to model and approach "loan repayment data"
I'm not sure what sort of regression models you are intending to run, as that would require some basic statistical understanding, knowing your data, and knowing what is it you are trying to do. Also, trying to predict with regression for something that is outside the observation range is ridiculously inaccurate. There are better means of doing that.
Nonetheless, here's an example of a linear regression and how one could go about predicting the data:
(1) converting the AccountId as a factor, changing the Reporting Month into a date and numerical date format:
library(dplyr)
df <- df %>%
mutate(id = as.factor(AccountId)) %>%
mutate(date = paste0(ReportingMonth, "01") %>% as.Date(., format="%Y%m%d")) %>%
mutate(day = as.numeric(date))
(2) Run a linear regression model AmountPaid ~ day + day:id + id
.
fit <- lm(AmountPaid ~ day*id, data=df)
summary(fit)
Call:
lm(formula = AmountPaid ~ day * id, data = df)
Residuals:
1 2 3 4 5 6
1.661 -3.500 1.839 3.643 -7.677 4.034
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.728e+03 2.884e+03 -0.946 0.4440
day 1.665e-01 1.751e-01 0.951 0.4422
id221144 -1.384e+04 4.078e+03 -3.394 0.0769 .
day:id221144 8.439e-01 2.477e-01 3.407 0.0764 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.31 on 2 degrees of freedom
Multiple R-squared: 0.9828, Adjusted R-squared: 0.9569
F-statistic: 38.01 on 3 and 2 DF, p-value: 0.02574
What the summary means is basically:
AmountPaid for id 112233 = 0.17*day - 2728 &
AmountPaid for id 221144 = (0.17+0.84)*day + (-2728 - 13840)
To predict the AmountPaid for each id monthly in 2015:
newdat <- expand.grid(date=seq.Date(as.Date("2015-01-01"), as.Date("2016-01-01"), by="month"),
id=levels(df$id))
newdat$day <- newdat$date %>% as.numeric
newdat[, c('fit', 'se')] <- predict(fit, newdata=newdat, se=T)[1:2]
Plotting out the fitted graph:
ggplot(newdat, aes(date, fit)) + geom_line() + geom_ribbon(aes(ymin=fit-1.96*se, ymax=fit+1.96*se), alpha=.2) +
geom_point(data=df, aes(date, AmountPaid)) +
ylab("AmountPaid") +
scale_x_date(date_break="1 month", date_labels="%Y-%m") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
facet_wrap(~id)