Search code examples
rfor-loopif-statementsapply

Rewriting a for loop to an sapply taking into account NA's


I would like to have R calculate the netincome for a certain amount of Income:

panelID = c(1:50)   
year= c(2001:2010)
country = "NLD"
n <- 2
library(data.table)
set.seed(123)
DT <- data.table(panelID = rep(sample(panelID), each = n),
                 country = rep(sample(country, length(panelID), replace = T), each = n),
                 year = c(replicate(length(panelID), sample(year, n))),
                 some_NA = sample(0:5, 6),                                             
                 some_NA_factor = sample(0:5, 6),         
                 norm = round(runif(100)/10,2),
                 Income = round(rnorm(10,10,10),2),
                 Happiness = sample(10,10),
                 Sex = round(rnorm(10,0.75,0.3),2),
                 Age = sample(100,100),
                 Educ = round(rnorm(10,0.75,0.3),2))        
DT [, uniqueID := .I]                                                         # Creates a unique ID     
DT[DT == 0] <- NA 
DT$Income[DT$Income < 0] <- NA 
DT <- as.data.frame(DT)

Now, the tax needs to be calculated as follows:

For the first five years (2001-2005), Income < 20 = 25%, Income >20 == 50%

For the second five years (2006-2010), Income < 15 = 20%, Income >20 == 45%

I tried to write it as follows:

for (i in DT$Income) {
    if (DT$Income[i] < 20 & DT$year[i] < 2006) {
        DT$netincome[i] <- DT$Income[i] - (DT$Income[i]*0.25)
    } else if (DT$Income[i] > 20 & DT$year[i] < 2006) {
        DT$netincome[i] <- DT$Income[i] - (20*0.25) - ((DT$Income[i]-20)*0.5)
    } else if (DT$Income[i] < 15 & DT$year[i] > 2005) {
        DT$netincome[i] <- DT$Income[i] - (DT$Income[i]*0.20)
    } else if (DT$Income[i] > 15 & DT$year[i] > 2005) {
        DT$netincome[i] <- DT$Income[i] - (15*0.20) - ((DT$Income[i]-15)*0.45)
    } 
    }

But I get the error:

Error in `$<-.data.frame`(`*tmp*`, "netincome", value = c(NA, NA, NA,  : 
  replacement has 15 rows, data has 100

In addition, I would really like to rewrite this in a cleaner way with sapply but I am struggling with how.


Solution

  • library(dplyr)
    DT[Income < 0,Income:= NA] # better use this construction
    DT[,.(netincome = case_when(Income < 20 & year < 2006 ~ Income - 0.25 * Income,
                                Income > 20 & year < 2006 ~ Income - 20 * 0.25 - 0.5 * (Income - 20),
                                Income < 15 & year > 2005 ~ Income - 0.2 * Income,
                                Income > 15 & year > 2005 ~ Income - 15*0.2 - 0.45 * (Income - 15)))]
    

    This would be much easier if you use consistent column name (best practice tolower). And try not to use names like DT. DT stands for one of a well used package in R, and it's a bit confusing. And in future version of data.table there would be an fcase, which faster then case_when