Search code examples
rpivotdata-manipulation

R: Pivoting Grouped Frequencies In Terms of their Counts


I am working with the R programming language.

I have the following dataset:

set.seed(123)
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)


status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status  <- as.factor(status )


country <- c("A", "B", "C", "D")
country <- sample(country, 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25))
country  <- as.factor(country)



################

disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
disease <- as.factor(disease)

###################
my_data = data.frame(gender, status, disease, country)

I want to find out the relative percentage of each unique group of factors that have the disease vs do not have the disease.

As an example:

  • What percentage of Male Immigrants from Country A have the disease vs don't have the disease
  • What percentage of Male Citizens from Country A have the disease vs don't have the disease (both these percentages should add to 1)
  • etc.

I tried to do this with the following code:

# https://stackoverflow.com/questions/24576515/relative-frequencies-proportions-with-dplyr
library(dplyr)

step_1 = my_data %>% group_by (gender, status, country, disease) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))


`summarise()` has grouped output by 'gender', 'status', 'country'. You can override using the
`.groups` argument.
# A tibble: 32 x 6
# Groups:   gender, status, country [16]
   gender status    country disease     n rel.freq
   <fct>  <fct>     <fct>   <fct>   <int> <chr>   
 1 Female Citizen   A       No        285 60%     
 2 Female Citizen   A       Yes       193 40%  

Now (assuming this is correct), I am trying to make some modifications to this result - this should reduce the number of rows by half (i.e. two rows in step_1 for yes and no are now combined into a single row):

step_2 = step_1 %>% 
  group_by(gender, status, country) %>% 
  summarize(disease = first(disease),
# number of people in this row that do not have the disease
            n_no = sum(disease == "No"),
# number of people in this row that do have the disease
            n_yes = sum(disease == "Yes"),
# relative percent of people in this row that do not  have the disease
            n_no_rel_freq = paste(round(sum(disease == "No") / sum(n) * 100), "%"),
# relative percent of people in this row that do have the disease
            n_yes_rel_freq = paste(round(sum(disease == "Yes") / sum(n) * 100), "%"),
# overall percent of all people in this row relative to entire population
            overall_percent = sum(n) / sum(step_1$n))

The code seems to run - but many of the percentages are now 0:

# A tibble: 16 x 9
# Groups:   gender, status [4]
   gender status    country disease  n_no n_yes n_no_rel_freq n_yes_rel_freq overall_percent
   <fct>  <fct>     <fct>   <fct>   <int> <int> <chr>         <chr>                    <dbl>
 1 Female Citizen   A       No          1     0 0 %           0 %                     0.102 
 2 Female Citizen   B       No          1     0 0 %           0 %                     0.092 

Can someone please show me how to fix this?

Thanks!

Note: The final result should look something like this

    # desired result (sample)
  gender  status country n_no n_yes n_no_rel_freq n_yes_rel_freq total overall_percent
1 female citizen       A  285   193           0.6            0.4   478          0.0956

Solution

  • Maybe you can make use of pivot-wider.

    library(tidyverse)
    
    
    my_data %>% group_by (gender, status, country, disease) %>%
      summarise (n=n()) %>%
      mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%")) -> step_1
    #> `summarise()` has grouped output by 'gender', 'status', 'country'. You can
    #> override using the `.groups` argument.
    
    step_1 |>group_by(country) |>  
      pivot_wider(names_from = disease, 
                  values_from = c(n:rel.freq), 
                  names_prefix = "disease_") |> 
      mutate(overallPerc = (n_disease_No + n_disease_Yes)/sum(step_1$n))
    #> # A tibble: 16 × 8
    #> # Groups:   country [4]
    #>    gender status    country n_disease_No n_disease_Yes rel.fre…¹ rel.f…² overa…³
    #>    <fct>  <fct>     <fct>          <int>         <int> <chr>     <chr>     <dbl>
    #>  1 Female Citizen   A                308           200 61%       39%      0.102 
    #>  2 Female Citizen   B                291           169 63%       37%      0.092 
    #>  3 Female Citizen   C                301           228 57%       43%      0.106 
    #>  4 Female Citizen   D                245           189 56%       44%      0.0868
    #>  5 Female Immigrant A                107            95 53%       47%      0.0404
    #>  6 Female Immigrant B                126            76 62%       38%      0.0404
    #>  7 Female Immigrant C                137            70 66%       34%      0.0414
    #>  8 Female Immigrant D                129            74 64%       36%      0.0406
    #>  9 Male   Citizen   A                237           167 59%       41%      0.0808
    #> 10 Male   Citizen   B                247           163 60%       40%      0.082 
    #> 11 Male   Citizen   C                250           171 59%       41%      0.0842
    #> 12 Male   Citizen   D                230           139 62%       38%      0.0738
    #> 13 Male   Immigrant A                103            68 60%       40%      0.0342
    #> 14 Male   Immigrant B                117            63 65%       35%      0.036 
    #> 15 Male   Immigrant C                 93            53 64%       36%      0.0292
    #> 16 Male   Immigrant D                102            52 66%       34%      0.0308
    #> # … with abbreviated variable names ¹​rel.freq_disease_No,
    #> #   ²​rel.freq_disease_Yes, ³​overallPerc