For the following data structure
dsN<-data.frame(
id=rep(1:100, each=4),
yearF=factor(rep(2001:2004, 100)),
attendF=sample(1:8, 400, T, c(.2,.2,.15,.10,.10, .20, .15, .02))
)
dsN[sample(which(dsN$yearF==2001), 5), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2002), 10), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2003), 15), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2004), 20), "attendF"]<-NA
attcol8<-c("Never"="#4575b4",
"Once or Twice"="#74add1",
"Less than once/month"="#abd9e9",
"About once/month"="#e0f3f8",
"About twice/month"="#fee090",
"About once/week"="#fdae61",
"Several times/week"="#f46d43",
"Everyday"="#d73027")
dsN$attendF<-factor(dsN$attendF, levels=1:8, labels=names(attcol8))
head(dsN,13)
id yearF attendF
1 1 2001 About once/week
2 1 2002 About once/month
3 1 2003 About once/week
4 1 2004 <NA>
5 2 2001 Less than once/month
6 2 2002 About once/week
7 2 2003 About once/week
8 2 2004 Several times/week
9 3 2001 Once or Twice
10 3 2002 About once/week
11 3 2003 <NA>
12 3 2004 Once or Twice
13 4 2001 Several times/week
we can obtain a series of a stacked bar charts
require(ggplot2)
# p<- ggplot( subset(dsN,!is.na(attendF)), aes(x=yearF, fill=attendF)) # leaving NA out of
p<- ggplot( dsN, aes(x=yearF, fill=attendF)) # keeping NA in calculations
p<- p+ geom_bar(position="fill")
p<- p+ scale_fill_manual(values = attcol8,
name="Response category" )
p<- p+ scale_y_continuous("Prevalence: proportion of total",
limits=c(0, 1),
breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<- p+ scale_x_discrete("Waves of measurement",
limits=as.character(c(2000:2005)))
p<- p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p
The graph above is produced from the raw data. However, it is sometimes convenient to produce graphs from summarized data, especially if one needs control over statistical functions. Below is transformation of dsN into ds that contains only the values that are actually mapped on the graph above:
require(dplyr)
ds<- dsN %.%
dplyr::filter(!is.na(attendF)) %.%
dplyr::group_by(yearF,attendF) %.%
dplyr::summarize(count = sum(attendF)) %.%
dplyr::mutate(total = sum(count),
percent= count/total)
head(ds,10)
Source: local data frame [10 x 5]
Groups: yearF
yearF attendF count total percent
1 2001 Never 18 373 0.04826
2 2001 Once or Twice 36 373 0.09651
3 2001 Less than once/month 30 373 0.08043
4 2001 About once/month 32 373 0.08579
5 2001 About twice/month 40 373 0.10724
6 2001 About once/week 90 373 0.24129
7 2001 Several times/week 119 373 0.31903
8 2001 Everyday 8 373 0.02145
9 2002 Never 11 355 0.03099
10 2002 Once or Twice 44 355 0.12394
# verify
summarize(filter(ds, yearF==2001), should.be.one=sum(percent))
```
Source: local data frame [1 x 2]
yearF should.be.one
1 2001 1
How would one re-create a graph from above using this summary dataset
ds
?
Well, part of the problem is that your summarizing is incorrect. You need to leave the NA values in there if you want to properly account for them in the total. Perhaps try
ds<- dsN %.%
dplyr::group_by(yearF,attendF) %.%
dplyr::summarize(count = length(attendF)) %.%
dplyr::mutate(total = sum(count, na.rm=T),
percent= count/total)
Then to use the summarized data, you only slightly need to change your first two lines
p<- ggplot( ds, aes(x=yearF, y=percent, fill=attendF)) # keeping NA in calculations
p<- p+ geom_bar(position="stack", stat="identity")
Note that we add a specific y
value and we tell geom_bar to use stat="identity"
which means to use the actual y
values we supplied as the height. And they they will produce the same image