Search code examples
rggplot2dplyrhistogramgeom-bar

Divide histogram counts per group using ggplot


I have a dataframe as such:

df2:

# A tibble: 38,161 x 5
   chromosome insRangeBegin cohort gender Cases
   <chr>              <dbl> <chr>  <chr>  <dbl>
 1 chr1              819957 WL-SA  F        173
 2 chr1              820179 WL-SA  F        173
 3 chr1             1610917 WL-PB  F        199
 4 chr1             1923485 WL-PB  F        199
 5 chr1             2098854 WL-SA  M        113
 6 chr1             4051411 WL-SA  F        173
 7 chr1             4099335 WL-SA  F        173
 8 chr1             4257094 WL-SA  F        173
 9 chr1             4346601 WL-SA  F        173
10 chr1             4348046 WL-SA  F        173
# … with 38,151 more rows

Say for each chromosome, I want to plot a histogram per cohort and gender with the counts divided by the number in column "Cases" for that cohort and gender.

Currently I generate the histogram with the following code:

df2 %>% filter(chromosome == "chr1") %>% ggplot(.) + geom_histogram(aes(x=insRangeBegin, fill=cohort), binwidth=5e6, position="stack") + facet_wrap(~gender, scales="free") + xlim(c(0, 249250621))

And I get: enter image description here

But the counts (y axis) are not normalized to the number of Cases (e.g, I have more counts in WL-SA F than WL-SA M because they come from 173 cases compared to 113 cases). I would like to get the same graph, but the counts for WL-SA F divided by 173, the counts for WL-SA M divided by 113 etc, in each bin. The desired result is a histogram of counts per case, with the number of cases as specified in the "Cases" column.


Solution

  • The solution was to use weights.

    > df2$weights<-1/df2$Cases
    > df2 %>% filter(chromosome == "chr1") %>% ggplot(., aes(x=insRangeBegin, weights=weights)) + geom_histogram(aes(fill=cohort), breaks = seq(0, 249250621, 5e6), position="stack") + facet_wrap(~gender, scales="free")
    

    enter image description here