Search code examples
rggplot2dplyrstackedstackedbarseries

Producing same graph from raw and summarized data


Same plot from raw and summarized data

For the following data structure

dsN<-data.frame(
  id=rep(1:100, each=4),
  yearF=factor(rep(2001:2004, 100)),
  attendF=sample(1:8, 400, T, c(.2,.2,.15,.10,.10, .20, .15, .02))
)
dsN[sample(which(dsN$yearF==2001), 5), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2002), 10), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2003), 15), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2004), 20), "attendF"]<-NA

attcol8<-c("Never"="#4575b4",
           "Once or Twice"="#74add1",
           "Less than once/month"="#abd9e9",
           "About once/month"="#e0f3f8",
           "About twice/month"="#fee090",
           "About once/week"="#fdae61",
           "Several times/week"="#f46d43",
           "Everyday"="#d73027")
dsN$attendF<-factor(dsN$attendF, levels=1:8, labels=names(attcol8))
head(dsN,13)

   id yearF              attendF
1   1  2001      About once/week
2   1  2002     About once/month
3   1  2003      About once/week
4   1  2004                 <NA>
5   2  2001 Less than once/month
6   2  2002      About once/week
7   2  2003      About once/week
8   2  2004   Several times/week
9   3  2001        Once or Twice
10  3  2002      About once/week
11  3  2003                 <NA>
12  3  2004        Once or Twice
13  4  2001   Several times/week

we can obtain a series of a stacked bar charts

require(ggplot2)
# p<- ggplot( subset(dsN,!is.na(attendF)), aes(x=yearF, fill=attendF)) # leaving NA out of
p<- ggplot( dsN, aes(x=yearF, fill=attendF))  #  keeping NA in calculations
p<- p+ geom_bar(position="fill")
p<- p+ scale_fill_manual(values = attcol8,
                         name="Response category" )
p<- p+ scale_y_continuous("Prevalence: proportion of total",
                          limits=c(0, 1),
                          breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<- p+ scale_x_discrete("Waves of measurement",
                        limits=as.character(c(2000:2005)))
p<- p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p

enter image description here

The graph above is produced from the raw data. However, it is sometimes convenient to produce graphs from summarized data, especially if one needs control over statistical functions. Below is transformation of dsN into ds that contains only the values that are actually mapped on the graph above:

require(dplyr)
ds<- dsN %.%
  dplyr::filter(!is.na(attendF)) %.%
  dplyr::group_by(yearF,attendF) %.%
  dplyr::summarize(count = sum(attendF)) %.%
  dplyr::mutate(total = sum(count),
              percent= count/total)
head(ds,10)

    Source: local data frame [10 x 5]
    Groups: yearF

       yearF              attendF count total percent
    1   2001                Never    18   373 0.04826
    2   2001        Once or Twice    36   373 0.09651
    3   2001 Less than once/month    30   373 0.08043
    4   2001     About once/month    32   373 0.08579
    5   2001    About twice/month    40   373 0.10724
    6   2001      About once/week    90   373 0.24129
    7   2001   Several times/week   119   373 0.31903
    8   2001             Everyday     8   373 0.02145
    9   2002                Never    11   355 0.03099
    10  2002        Once or Twice    44   355 0.12394

# verify
summarize(filter(ds, yearF==2001), should.be.one=sum(percent))
```

    Source: local data frame [1 x 2]

      yearF should.be.one
    1  2001             1

Question:

How would one re-create a graph from above using this summary dataset ds?


Solution

  • Well, part of the problem is that your summarizing is incorrect. You need to leave the NA values in there if you want to properly account for them in the total. Perhaps try

    ds<- dsN %.%
      dplyr::group_by(yearF,attendF) %.%
      dplyr::summarize(count = length(attendF)) %.%
      dplyr::mutate(total = sum(count, na.rm=T),
                  percent= count/total)
    

    Then to use the summarized data, you only slightly need to change your first two lines

    p<- ggplot( ds, aes(x=yearF, y=percent, fill=attendF))  #  keeping NA in calculations
    p<- p+ geom_bar(position="stack", stat="identity")
    

    Note that we add a specific y value and we tell geom_bar to use stat="identity" which means to use the actual y values we supplied as the height. And they they will produce the same image

    enter image description here