In R, I want to create a "category" variable that depends on differences between the date of birth and the initial date of individuals born within a group (min) (which is in another dataframe), so that an individual will be in a category 1 if was born between the first 10 days within his category, in category 2 if he was born in the second 10 days, etc. and I want to generate categories until all the possible dates are filled.
I hope my example clarifies it. I made a loop that is running but has some disadvantages: 1) It requires to merge the two dataframes, and 2) It requires to specify one by one the categories.
group = c("A","B","C","D","E","F","G","H","I","J")
min = c(100,125,120,98,89,110,115,130,100,90)
max = c(140,185,220,200,145,150,145,170,170,140)
DATES = data.frame(group,min,max)
ind = c(1:20)
group = c("B","B","B","A","C","D","D","E","J","I","H","A","F","I","E","F","E","C","C","A")
birth_date = c(130,180,150,110,130,200,100,140,130,100,155,130,110,120,135,115,100,150,200,105)
BIRTH = data.frame(ind,group,birth_date)
BIRTH = merge(BIRTH, DATES, by = "group")
BIRTH$category = NA
for (i in 1:nrow(BIRTH)){
if(isTRUE(BIRTH$birth_date[i] <= BIRTH$min[i] + 10)==TRUE){
BIRTH$category[i] = 1
} else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*2)==TRUE) {
BIRTH$category[i] = 2
} else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10*2 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*3)==TRUE) {
BIRTH$category[i] = 3
} else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10*3 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*4)==TRUE) {
BIRTH$category[i] = 4
} else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10*4 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*5)==TRUE) {
BIRTH$category[i] = 5
} else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10*5 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*6)==TRUE) {
BIRTH$category[i] = 6
} else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10*6 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*7)==TRUE) {
BIRTH$category[i] = 7
} else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10*7 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*8)==TRUE) {
BIRTH$category[i] = 8
} else if (isTRUE(BIRTH$birth_date[i] >= BIRTH$min[i] + 10*8 & BIRTH$birth_date[i] < BIRTH$min[i] + 10*9)==TRUE) {
BIRTH$category[i] = 9
} else {
BIRTH$category[i] = 10
}
}
Any suggestions to make it more elegant/efficient?
Without the need to merge
the data frames, match
could be the way. We can write a function ctgrz_dts
that categorizes dates, elaborating on @MrFlick's answer.
ctgrz_dts <- \(x, tbl=DATES, g='group', bdc='birth_date') {
a <- tbl[match(x[[g]], tbl[['group']]), -1]
## optional check if birthdate in date range
valid <- sapply(seq_len(nrow(a)), \(i)
x[[bdc]][i] >= a[i, ]$min & x[[bdc]][i] <= a[i, ]$max)
ctg <- (x[[bdc]] - a$min) %/% 10 + 1
ctg[!valid] <- NA ## set NA if date not in range
return(ctg)
}
Note, that I defined arguments for the group and birthdate columns, to make the function more flexible.
BIRTH$cat <- ctgrz_dts(BIRTH, DATES)
head(BIRTH)
# ind group birth_date cat
# 1 1 B 130 1
# 2 2 B 180 6
# 3 3 B 150 3
# 4 4 A 110 2
# 5 5 C 130 2
# 6 6 D 200 11
To check, we can merge
and run @MrFlick's code now,
BIRTH <- merge(BIRTH, DATES, by = "group")
head(transform(BIRTH, category=(birth_date - min) %/% 10 + 1))
# group ind birth_date cat min max category
# 1 A 4 110 2 100 140 2
# 2 A 12 130 4 100 140 4
# 3 A 20 105 1 100 140 1
# 4 B 1 130 1 125 185 1
# 5 B 2 180 6 125 185 6
# 6 B 3 150 3 125 185 3
to confirm that ctgrz_dts()
gives the same categories without merging (note, that the order is different after merging).