Search code examples
rlayoutoutputsummarygroup

R: Summary statistics for groups / subsets within panel data - code and layout


I have the following data (posted in the comment):

Now I would like to have summary statistics. I only need the mean and the number of observations. The summary statistics should be grouped according to the rating of the product and there should be a difference in means test for the means of the ratings 1 and 5. In the end it should look like this:

Desired layout / output of summary statistics

I came across the describeBy function. However, the problem there is that I could not end up with my desired layout (see picture) and I could not include the difference in the means test for the means of the rating 1 and 5 and the means for the whole sample.

Further, I also tried to use the stargazer package. But there I had similar problems.

Could anyone help me here?


Solution

  • You can use this dplyr/tidy pipeline:

    library(tidyverse)
    
    dt %>%
      group_by(Rating) %>% 
      summarize(mean_Revenue = mean(Revenue),
                mean_Costs = mean(Costs),
                mean_Age = mean(Age),
                Observations=n()
      ) %>% 
      pivot_longer(cols = !Rating) %>% 
      pivot_wider(id_cols = "name",names_from = Rating,values_from = value,names_glue = "Rating{.name}") %>% 
      mutate(`Anova F-Test (p-value)` = c(sapply(dt %>% select(Revenue:Age), function(y) anova(lm(y~dt$Rating))$`Pr(>F)`[[1]]),NA)) %>% 
      left_join(
        dt %>%  
          pivot_longer(cols=Revenue:Age) %>% 
          group_by(name = paste0("mean_",name)) %>% 
          summarize(Total_means=mean(value))
      )
    
    

    Output:

      name         Rating1 Rating2 Rating3 Rating4 Rating5 `Anova F-Test (p-value)` Total_means
      <chr>          <dbl>   <dbl>   <dbl>   <dbl>   <dbl>                    <dbl>       <dbl>
    1 mean_Revenue     200   400       250     300     200                    0.742       289. 
    2 mean_Costs        45    26.7      40      30      20                    0.196        33.3
    3 mean_Age           2     3         4       4       2                    0.552         3  
    4 Observations       2     3         2       1       1                   NA            NA  
    

    Updated 4/22/22

    • Original answer did not limit the anova to Ratings 1 and 5
    # small function to get anova
    get_anova <-function(y,rating, ratings=c(1,5)) {
      y_ = y[rating %in% ratings]
      x_ = rating[rating %in% ratings]
      anova(lm(y_~x_))$`Pr(>F)`[[1]]
    }
    
    dt %>%
      group_by(Rating) %>% 
      summarize(mean_Revenue = mean(Revenue),
                mean_Costs = mean(Costs),
                mean_Age = mean(Age),
                Observations=n()
      ) %>% 
      pivot_longer(cols = !Rating) %>% 
      pivot_wider(id_cols = "name",names_from = Rating,values_from = value,names_glue = "Rating{.name}") %>% 
      mutate(anova = c(sapply(dt %>% select(Revenue:Age), function(y) get_anova(y,rating=dt$Rating)),NA)) %>% 
      left_join(
        dt %>%  
          pivot_longer(cols=Revenue:Age) %>% 
          group_by(name = paste0("mean_",name)) %>% 
          summarize(Total_means=mean(value))
      )