I know that this command below will summarize my table by adding the population by group and dividing it by the number of rows of each group.
dt[, .(pop=sum(pop_ct)/sum(.N)), , by=.(geoid)]
What I want to do, however, is to dived the summed population by the number of rows of another column in each group. something like this:
dt[, .(pop=sum(pop_ct)/ nrow(dt$geoid) ), , by=.(geoid)]
The point here is that the geoid
id6
and id7
are subregions of ct
E1010
, so the population of id6
and id7
should be an equal proportion of the population of the larger area E1010
in which they are contained.
Using the reproducible example bellow, this is the result I would like to get to:
> geoid pop
> 1: id1 47
> 2: id2 35
> 3: id3 10
> 4: id5 30
> 5: id4 10
> 6: id6 10
> 7: id7 10
dt <- data.table(
udh = LETTERS[c(1,1,1,1,2,2,3,3,3,4,5,5)],
ct = c('A000','A111','A222','A333','B444','B555','C666','C777','C888','D999','E1010','E1010'),
pop_udh = c(40,40,40,40,30,30,45,45,45,17,20,20),
pop_ct = c(20,10,8,2,25,5,5,30,10,17,20,20),
poor_prop_udh = c(10,10,10,10,5,5,8,8,8,7,9,9),
geoid = c('id1','id2','id2','id1','id1','id3','id3','id5','id4','id2','id6','id7'))
One way is to pre calculate the sizes of ct
and the divide accordingly, for example
dt[, N := .N, by = ct
][, .(pop = sum(pop_ct)/N[1L]), by = geoid]
# geoid pop
# 1: id1 47
# 2: id2 35
# 3: id3 10
# 4: id5 30
# 5: id4 10
# 6: id6 10
# 7: id7 10