Search code examples
rdata.tablegroup-summaries

Summarize datatable by group based on nrows of other column


I know that this command below will summarize my table by adding the population by group and dividing it by the number of rows of each group.

dt[, .(pop=sum(pop_ct)/sum(.N)), , by=.(geoid)]

What I want to do, however, is to dived the summed population by the number of rows of another column in each group. something like this:

dt[, .(pop=sum(pop_ct)/ nrow(dt$geoid) ), , by=.(geoid)]

The point here is that the geoid id6 and id7 are subregions of ct E1010, so the population of id6 and id7 should be an equal proportion of the population of the larger area E1010 in which they are contained.

Expected result

Using the reproducible example bellow, this is the result I would like to get to:

>    geoid pop
> 1:   id1  47
> 2:   id2  35
> 3:   id3  10
> 4:   id5  30
> 5:   id4  10
> 6:   id6  10
> 7:   id7  10

Reproducible example

dt <- data.table(
  udh = LETTERS[c(1,1,1,1,2,2,3,3,3,4,5,5)],
  ct = c('A000','A111','A222','A333','B444','B555','C666','C777','C888','D999','E1010','E1010'),
  pop_udh = c(40,40,40,40,30,30,45,45,45,17,20,20),
  pop_ct = c(20,10,8,2,25,5,5,30,10,17,20,20),
  poor_prop_udh = c(10,10,10,10,5,5,8,8,8,7,9,9),
  geoid = c('id1','id2','id2','id1','id1','id3','id3','id5','id4','id2','id6','id7'))

Solution

  • One way is to pre calculate the sizes of ct and the divide accordingly, for example

    dt[, N := .N, by = ct
       ][, .(pop = sum(pop_ct)/N[1L]), by = geoid]
    #    geoid pop
    # 1:   id1  47
    # 2:   id2  35
    # 3:   id3  10
    # 4:   id5  30
    # 5:   id4  10
    # 6:   id6  10
    # 7:   id7  10