Search code examples
rdataframeloopscategoriesgroup

Create category by group using a loop


In R, I want to create a "category" variable that depends on differences between the date of birth and the initial date of individuals born within a group (min) (which is in another dataframe), so that an individual will be in a category 1 if was born between the first 10 days within his category, in category 2 if he was born in the second 10 days, etc. and I want to generate categories until all the possible dates are filled.

I hope my example clarifies it. I made a loop that is running but has some disadvantages: 1) It requires to merge the two dataframes, and 2) It requires to specify one by one the categories.

group = c("A","B","C","D","E","F","G","H","I","J")
min = c(100,125,120,98,89,110,115,130,100,90)
max = c(140,185,220,200,145,150,145,170,170,140)
DATES = data.frame(group,min,max)
ind = c(1:20)
group = c("B","B","B","A","C","D","D","E","J","I","H","A","F","I","E","F","E","C","C","A")
birth_date = c(130,180,150,110,130,200,100,140,130,100,155,130,110,120,135,115,100,150,200,105)
BIRTH = data.frame(ind,group,birth_date)
BIRTH = merge(BIRTH, DATES, by = "group")
BIRTH$category = NA
for (i in 1:nrow(BIRTH)){
  if(isTRUE(BIRTH$birth_date[i] <= BIRTH$min[i] + 10)==TRUE){
    BIRTH$category[i] = 1
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*2)==TRUE) {
    BIRTH$category[i] = 2
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10*2 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*3)==TRUE) {
    BIRTH$category[i] = 3
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10*3 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*4)==TRUE) {
    BIRTH$category[i] = 4
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10*4 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*5)==TRUE) {
    BIRTH$category[i] = 5
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10*5 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*6)==TRUE) {
    BIRTH$category[i] = 6
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10*6 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*7)==TRUE) {
    BIRTH$category[i] = 7
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10*7 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*8)==TRUE) {
    BIRTH$category[i] = 8
  } else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10*8 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*9)==TRUE) {
    BIRTH$category[i] = 9
    } else {
    BIRTH$category[i] = 10
  }
}

Any suggestions to make it more elegant/efficient?


Solution

  • Without the need to merge the data frames, match could be the way. We can write a function ctgrz_dts that categorizes dates, elaborating on @MrFlick's answer.

    ctgrz_dts <- \(x, tbl=DATES, g='group', bdc='birth_date') {
      a <- tbl[match(x[[g]], tbl[['group']]), -1]
      ## optional check if birthdate in date range
      valid <- sapply(seq_len(nrow(a)), \(i) 
                      x[[bdc]][i] >= a[i, ]$min & x[[bdc]][i] <= a[i, ]$max)
      ctg <- (x[[bdc]] - a$min) %/% 10 + 1
      ctg[!valid] <- NA  ## set NA if date not in range
      return(ctg)
    }
    

    Note, that I defined arguments for the group and birthdate columns, to make the function more flexible.

    Usage

    BIRTH$cat <- ctgrz_dts(BIRTH, DATES)
    head(BIRTH)
    #   ind group birth_date cat
    # 1   1     B        130   1
    # 2   2     B        180   6
    # 3   3     B        150   3
    # 4   4     A        110   2
    # 5   5     C        130   2
    # 6   6     D        200  11
    

    To check, we can merge and run @MrFlick's code now,

    BIRTH <- merge(BIRTH, DATES, by = "group")
    head(transform(BIRTH, category=(birth_date - min) %/% 10 + 1))
    #   group ind birth_date cat min max category
    # 1     A   4        110   2 100 140        2
    # 2     A  12        130   4 100 140        4
    # 3     A  20        105   1 100 140        1
    # 4     B   1        130   1 125 185        1
    # 5     B   2        180   6 125 185        6
    # 6     B   3        150   3 125 185        3
    

    to confirm that ctgrz_dts() gives the same categories without merging (note, that the order is different after merging).