Search code examples
rfrequency-distribution

How to get table in R, including count, relative frequencies, and cumulative frequencies?


I have used R Studio now for years and more often so than any other software, but now that I'm gioing to teach statistics with R, I realize that some tasks are just simpler using other software such as STATA.

Is there a simple way of getting a frequency table in R (including count, percent, and cumulative frequencies) just like we would get by typing tab [variable] in STATA?

I came across this tidyverse solution:

dataset <- tribble(
           ~var1, ~var2, ~var3, ~var4, ~var5,
           "1",   "1",   "1",   "a",   "d",
           "2",   "2",   "2",   "b",   "e",
           "3",   "3",   "3",   "c",   "f")

dataset %>%
      group_by(var1) %>%
      summarise(n = n()) %>%
      mutate(totalN = (cumsum(n)),
             percent = round((n / sum(n)), 3),
             cumpercent = round(cumsum(freq = n / sum(n)),3))

But this is, very obviously, far to complicated to teach undergrads. Isn't there an easier way, maybe a base R solution even? Ideally, I would like to have one line of code for which I don't have to install 5-10 different packages first.


Solution

  • I don't agree with your claims about undergrads not being able to understand. I don't want to get this question into a teaching strategies and whether you should be using R if you don't believe it's proper for the level of your course.

    You can supply them with this function, which they don't have to understand (the same way they don't have to understand the one from STATA).

    library(dplyr)
    tab <- function(dataset, var){
    
      dataset %>%
        # embrace var to be able to call it with any grouping factor
        group_by({{var}}) %>% 
        summarise(n=n()) %>%
        mutate(totalN = cumsum(n),
               percent = n / sum(n),
               cumpercent = cumsum(n / sum(n)))
    
    }
    
    

    Then (provided you source("tab.R")), here's your one liner:

    tab(dataset, var1)
    # A tibble: 3 x 5
      var1      n totalN percent cumpercent
      <chr> <int>  <int>   <dbl>      <dbl>
    1 1         1      1   0.333      0.333
    2 2         1      2   0.333      0.667
    3 3         1      3   0.333      1  
    

    You can try tab(dataset, var2). Please note that this answer will only group by one factor (this was your question).

    EDIT

    one needs to understand how to set the working directory (etc.)

    Not entirely true, if you are using Rstudio, you can manually import a dataset with clicks from a folder. If you want to teach stats using R (which I think you definitely should), you should have at least one class of minimal things (yes, that includes working directory, how to call library(...) and basic functions). There are a huge amount of resources (books, YouTube tutorials) you can assign as homewokrs/part of the class, so students become familiar. The argument of WHATEVER SOFTWARE IS EASIER is weak if we drop all assumptions, I would need to know how where to click for the specific version of whatever software...