Search code examples
rdplyraggregateinner-join

Aggregating a dataframe with dplyr in R based on several dummy variables


I am using dplyr to aggregate my dataframe, so it shows percentages of people choosing specific protein design tasks by company size. I have different dummy variables for protein design tasks, because this was a multiple choice question in a survey. I figured out a way to do this, but my code is very long, because I aggregate the data per task and then join all these separate dataframes together into one. I’m curious whether there is a more elegant (shorter) way to do this?

library(tidyverse)

EarlyAccess <- read_csv("https://dropbox.com/s/antzwk1jh4ldrhi/EarlyAccess_anon.csv?dl=1")

#################### STABILITY ################################################
Proportions_tasks_stability <- EarlyAccess %>% 
  select(size, Improving.stability..generic..thermal..pH.) %>% 
  group_by(size, Improving.stability..generic..thermal..pH.) %>% 
  summarise(count_var_stability=n())%>%
  mutate(total_group_by_size = sum(count_var_stability)) %>%
  mutate(pc_var_stability=count_var_stability/sum(count_var_stability)*100) %>%
  filter(Improving.stability..generic..thermal..pH.=="Improving stability (generic, thermal, pH)") %>%
  select(size, Improving.stability..generic..thermal..pH., pc_var_stability)

######################## ACTIVITY #############################################
Proportions_tasks_activity <- EarlyAccess %>% 
  select(size, Improving.activity ) %>% 
  group_by(size, Improving.activity) %>% 
  summarise(count_var_activity=n())%>%
  mutate(total_group_by_size = sum(count_var_activity)) %>%
  mutate(pc_var_activity=count_var_activity/sum(count_var_activity)*100) %>%
  filter(Improving.activity=="Improving activity") %>%
  select(size, Improving.activity,  pc_var_activity)

######################## BINDING AFFINITY ######################################
Proportions_tasks_binding.affinity<- EarlyAccess %>% 
  select(size, Improving.binding.affinity ) %>% 
  group_by(size, Improving.binding.affinity) %>% 
  summarise(count_var_binding.affinity=n())%>%
  mutate(total_group_by_size = sum(count_var_binding.affinity)) %>%
  mutate(pc_var_binding.affinity=count_var_binding.affinity/sum(count_var_binding.affinity)*100) %>%
  filter(Improving.binding.affinity=="Improving binding affinity") %>%
  select(size, Improving.binding.affinity,  pc_var_binding.affinity)



# Then join them
Protein_design_tasks <- Proportions_tasks_stability %>%
  inner_join(Proportions_tasks_activity, by = "size") %>%
  inner_join(Proportions_tasks_binding.affinity, by = "size")

Solution

  • Using the datafile you provided, this should give the percentages of the selected category within each column for each size:

    library(tidyverse)
    
    df <-
      read_csv("https://dropbox.com/s/antzwk1jh4ldrhi/EarlyAccess_anon.csv?dl=1")
    
    df |>
      group_by(size) |>
      summarise(
        pc_var_stability = sum(
          Improving.stability..generic..thermal..pH. == "Improving stability (generic, thermal, pH)",
          na.rm = TRUE
        ) / n() * 100,
        pc_var_activity = sum(Improving.activity == "Improving activity",
                              na.rm = TRUE) / n() * 100,
        pc_var_binding.affinity = sum(
          Improving.binding.affinity == "Improving binding affinity",
          na.rm = TRUE
        ) / n() * 100
      )
    #> # A tibble: 7 × 4
    #>   size       pc_var_stability pc_var_activity pc_var_binding.affinity
    #>   <chr>                 <dbl>           <dbl>                   <dbl>
    #> 1 1000-10000             43.5            47.8                    34.8
    #> 2 10000+                 65              65                      70  
    #> 3 11-50                  53.8            53.8                    46.2
    #> 4 2-10                   51.1            46.8                    46.8
    #> 5 200-1000               64.7            52.9                    52.9
    #> 6 50-200                 42.1            42.1                    36.8
    #> 7 Just me                48.5            39.4                    54.5
    

    Looking at your data, each column has either the string value you're testing for or NA, so you could make it even shorter/tidier just by counting non-NAs in relevant columns:

    df |>
      group_by(size) |>
      summarise(across(
        c(
          Improving.stability..generic..thermal..pH.,
          Improving.activity,
          Improving.binding.affinity
        ),
        \(val) 100 * sum(!is.na(val)) / n()
      ))
    

    If what you're aiming to do is summarise across all columns then the latter method may work best - there are several ways of specifying which columns you want and so you don't necessarily need to type all names and values in. You might also find it clearest to make calculating and formatting all percentages a named function to call:

    library(tidyverse)
    
    df <-
      read_csv("https://dropbox.com/s/antzwk1jh4ldrhi/EarlyAccess_anon.csv?dl=1", 
               show_col_types = FALSE)
    
    
    perc_nonmissing <- function(val) {
      sprintf("%.1f%%", 100 * sum(!is.na(val)) / n())
    }
    
    
    df |>
      group_by(size) |>
      summarise(across(-c(1:2), perc_nonmissing))
    
    #> # A tibble: 7 × 12
    #>   size   Disco…¹ Searc…² Under…³ Impro…⁴ Impro…⁵ Impro…⁶ Impro…⁷ Impro…⁸ Impro…⁹
    #>   <chr>  <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
    #> 1 1000-… 21.7%   17.4%   43.5%   47.8%   39.1%   43.5%   30.4%   39.1%   39.1%  
    #> 2 10000+ 40.0%   55.0%   55.0%   65.0%   70.0%   65.0%   20.0%   30.0%   40.0%  
    #> 3 11-50  30.8%   26.9%   42.3%   53.8%   38.5%   53.8%   15.4%   30.8%   38.5%  
    #> 4 2-10   38.3%   40.4%   48.9%   46.8%   36.2%   51.1%   23.4%   31.9%   42.6%  
    # etc.