Search code examples
rdplyrsummarize

Summarise proportion of grouped variable in r


I am trying to summarize a grouped variable, but I want to get the sum as the proportion of the number in each group. Here is an example:

Create the following dataframe:

dat <- data.frame(year = c(rep(1970, 4), rep(1980, 4), rep(1990, 4)), 
                  country = rep(1:4, 3), 
                  change = c(0.1, 0.1, 0.1, -0.1, NA, 0.1, 0.1, 0.1, NA, NA, 0.1, 0.1))
> dat
   year country change
1  1970       1    0.1
2  1970       2    0.1
3  1970       3    0.1
4  1970       4   -0.1
5  1980       1     NA
6  1980       2    0.1
7  1980       3    0.1
8  1980       4    0.1
9  1990       1     NA
10 1990       2     NA
11 1990       3    0.1
12 1990       4    0.1

I can generate a count of the number of countries experiencing a positive change:

dat %>%
  mutate(change.true = ifelse(change > 0, 1, 0)) %>%
  group_by(year) %>%
  summarise(count.positive = sum(change.true))

# A tibble: 3 x 2
   year count.positive
  <dbl>          <dbl>
1  1970              3
2  1980              3
3  1990              2

This suggests that the number of countries seeing a positive change is stable or decreasing. However this is an effect of there being a decreasing number of countries for which there is data (I undertstand that i need to treat missing data with caution). What I would like to explore is how the proportion of countries for which there is data is changing. Essentially i want this:

data.frame(year = c(1970, 1980, 1990), 
           prop.positive = c(0.75, 1, 1))
  year prop.positive
1 1970          0.75
2 1980          1.00
3 1990          1.00

How do I do I amend the summarize function so that I get a count of the group size for each group? I'd be happy to mutate another column with the group size and then calculate the proportion, but don't know how to get that group size. It goes without saying that I am trying to do this for a massive data set.


Solution

  • We can get the proportion by making use of na.rm from the mean i.e. get the mean of logical expression (change > 0) while removing the `NA

    library(dplyr)
    dat %>% 
        group_by(year) %>% 
        summarise(prop.positive = mean(change > 0, na.rm = TRUE))
    # A tibble: 3 x 2
    #   year prop.positive
    #  <dbl>         <dbl>
    #1  1970          0.75
    #2  1980          1   
    #3  1990          1   
    

    If some of the 'country' are missing, then we can do a complete on the dataset to get the missing values and then do the group_by summarise

    library(tidyr)
    dat %>%
      complete(year, country) %>%
      group_by(year) %>%
      summarise(prop.positive = mean(change > 0, na.rm = TRUE))
    

    In base R, we can do the same with aggregate

    aggregate(cbind(prop.positive = change > 0) ~ year, dat,
               mean, na.rm = TRUE, na.action = na.pass)
    #   year prop.positive
    #1 1970          0.75
    #2 1980          1.00
    #3 1990          1.00