Search code examples
rfor-loopif-statementnested-loops

Nested for loop to create a column, using two data sets in R


I want to create a new variable(named "treatment") in a dataset using two different datasets. My original datasets are two big datasets with other variables however, for simplicity, let's say I have the following datasets:

#individual level data, birth years
a <- data.frame (country_code = c(2,2,2,10,10,10,10,8), 
                               birth_year = c(1920,1930,1940,1970,1980,1990, 2000, 1910))
#country level reform info with affected cohorts
b <- data.frame(country_code = c(2,10,10,11),
                      lower_cutoff = c(1928, 1975, 1907, 1934),
                      upper_cutoff = c(1948, 1995, 1927, 1948),
                      cohort = c(1938, 1985, 1917, 1942))

Dataset a is an individual dataset with birth year informations and dataset b is country-level data with some country reform information. According to dataset b I want to create a treatment column in the dataset a. Treatment is 1 if the birth_year is between the cohort and upper_cutoff and 0 if between cohort and lower_cutoff. And anything else should be NA.

After creating an empty treatment column, I used the following code below:

for(i in 1:nrow(a)) {
  for(j in 1:nrow(b)){
    a$treatment[i] <- ifelse(a$country_code[i] == b$country_code[j] & 
                                    a$birth_year[i] >= b$cohort[j] & 
                                    a$birth_year[i]<= b$upper_cutoff[j], "1",
                                  ifelse(a$ison[i] == b$ison[j] & 
                                           a$birth_year[i] < b$cohort[j] & 
                                           a$birth_year[i]>= b$lower_cutoff[j], "0", NA))
  }
}

As well as:

for(i in 1:nrow(a)) {
  for(j in 1:nrow(b)){
    a[i, "treatment"] <- case_when(a[i,"country_code"] == b[j, "country_code"] &
                                          a[i,"birth_year"] >= b[j,"cohort"] &
                                          a[i,"birth_year"]<= b[j,"upper_cutoff"] ~ 1,
                                        a[i,"country_code"] == b[j, "country_code"] &
                                          a[i,"birth_year"] < b[j,"cohort"]&
                                          a[i,"birth_year"]>= b[j,"lower_cutoff"] ~ 0)
  }
}

Both codes run, but they only return NAs. The following is the result I want to get:

treatment <- c(NA, 0, 1, NA, 0, 1, NA, 0)

Any ideas about what is wrong? Or any other suggestions? Thanks in advance!


Solution

  • I found the mistake in my code. Apparently, I should have used a break to avoid overwriting the variable I'm creating. But, I'm still open to other answers.

    for(i in 1:nrow(a)) {
      for(j in 1:nrow(b)){
    
    if(!is.na(a$treatment[i])){break} #to make it stop if I already assign a value
    
    a$treatment[i] <- ifelse(a$country_code[i] == b$country_code[j] & 
                                        a$birth_year[i] >= b$cohort[j] & 
                                        a$birth_year[i]<= b$upper_cutoff[j], "1",
                                      ifelse(a$ison[i] == b$ison[j] & 
                                               a$birth_year[i] < b$cohort[j] & 
                                               a$birth_year[i]>= b$lower_cutoff[j], "0", NA))
      }
    }