I have a dataframe as such:
df2:
# A tibble: 38,161 x 5
chromosome insRangeBegin cohort gender Cases
<chr> <dbl> <chr> <chr> <dbl>
1 chr1 819957 WL-SA F 173
2 chr1 820179 WL-SA F 173
3 chr1 1610917 WL-PB F 199
4 chr1 1923485 WL-PB F 199
5 chr1 2098854 WL-SA M 113
6 chr1 4051411 WL-SA F 173
7 chr1 4099335 WL-SA F 173
8 chr1 4257094 WL-SA F 173
9 chr1 4346601 WL-SA F 173
10 chr1 4348046 WL-SA F 173
# … with 38,151 more rows
Say for each chromosome, I want to plot a histogram per cohort and gender with the counts divided by the number in column "Cases" for that cohort and gender.
Currently I generate the histogram with the following code:
df2 %>% filter(chromosome == "chr1") %>% ggplot(.) + geom_histogram(aes(x=insRangeBegin, fill=cohort), binwidth=5e6, position="stack") + facet_wrap(~gender, scales="free") + xlim(c(0, 249250621))
But the counts (y axis) are not normalized to the number of Cases (e.g, I have more counts in WL-SA F than WL-SA M because they come from 173 cases compared to 113 cases). I would like to get the same graph, but the counts for WL-SA F divided by 173, the counts for WL-SA M divided by 113 etc, in each bin. The desired result is a histogram of counts per case, with the number of cases as specified in the "Cases" column.
The solution was to use weights.
> df2$weights<-1/df2$Cases
> df2 %>% filter(chromosome == "chr1") %>% ggplot(., aes(x=insRangeBegin, weights=weights)) + geom_histogram(aes(fill=cohort), breaks = seq(0, 249250621, 5e6), position="stack") + facet_wrap(~gender, scales="free")