Search code examples
rstatisticsfrequencycategories

Nested tables and calculating summary statistics with confidence intervals in R


This question is about the statistical program R.

Data

I have a data frame, study_data, that has 100 rows, each representing a different person, and three columns, gender, height_category, and freckles. The variable gender is a factor and takes the value of either "male" or "female". The variable height_category is also a factor and takes the value of "tall" or "short". The variable freckles is a continuous, numeric variable that states how many freckles that individual has.

Here are some example data (thanks to Roland for this):

set.seed(42)
DF <- data.frame(gender=sample(c("m","f"),100,T),
      height_category=sample(c("tall","short"),100,T),
      freckles=runif(100,0,100))

Question 1

I would like to create a nested table that divides these patients into "male" versus "female", further subdivides them into "tall" versus "short", and then calculates the number of patients in each sub-grouping along with the median number of freckles with the lower and upper 95% confidence interval.

Example

The table should look something like what is shown below, where the # signs are replaced with the appropriate calculated results.

gender height_category n median_freckles LCI UCI

male              tall #               #   #   #
                 short #               #   #   #
female            tall #               #   #   #
                 short #               #   #   #

Question 2

Once these results have been calculated, I would then like to create a bar graph. The y axis will be the median number of freckles. The x axis will be divided into male versus female. However, these sections will be subdivided by height category (so there will be a total of four bars in groups of two). I'd like to overlay the 95% confidence bands on top of the bars.

What I've tried

I know that I can make a nested table using the MASS library and xtabs command:

ftable(xtabs(formula = ~ gender + height_category, data = study_data))

However, I'm not sure how to incorporate calculating the median of the number of freckles into this command and then getting it to show up in the summary table. I'm also aware that ggplot2 can be used to make bar graphs, but am not sure how to do this given that I can't calculate the data that I need in the first place.


Solution

  • set.seed(42)
    DF <- data.frame(gender=sample(c("m","f"),100,T),
                     height_category=sample(c("tall","short"),100,T),
                     freckles=runif(100,0,100))
    
    
    library(plyr)
    res <- ddply(DF,.(gender,height_category),summarise,
                 n=length(na.omit(freckles)),
                 median_freckles=quantile(freckles,0.5,na.rm=TRUE),
                 LCI=quantile(freckles,0.025,na.rm=TRUE),
                 UCI=quantile(freckles,0.975,na.rm=TRUE))
    
    library(ggplot2)
    p1 <- ggplot(res,aes(x=gender,y=median_freckles,ymin=LCI,ymax=UCI,
                         group=height_category,fill=height_category)) +
      geom_bar(stat="identity",position="dodge") +
      geom_errorbar(position="dodge")
    print(p1)
    

    enter image description here

    #a better plot that doesn't require to precalculate the stats
    library(hmisc)
    p2 <- ggplot(DF,aes(x=gender,y=freckles,colour=height_category)) + 
      stat_summary(fun.data="median_hilow",geom="pointrange",position = position_dodge(width = 0.4))
    print(p2)
    

    enter image description here