I'm analyzing some survey data and using expss
to create tables.
One of our questions is about brand awareness. I have 3 types of brands: BrandA is a brand that a large subset of the sample sees, BrandB is a brand that a smaller (mutually exclusive!) subset of the sample sees, and BrandC is a brand that every respondent sees.
I'd like to treat this awareness question as a multiple response question and report the % of people (who actually saw the brand) who are aware of each brand. (In this case, a value of 1 means that the respondent was aware of the brand.)
The closest I can get is by using the code below, but tab_stat_cpct()
is not reporting accurate percentages or # of cases, as you can see in the attached table. When you compare the Total % listed in the table to the total % computed manually (i.e., via mean(data$BrandA, na.rm = TRUE)
), it is reporting values that are too low for BrandA and BrandB, and a value that is too high for BrandC. (Not to mention that the total # of cases should be 25.)
I've read over the documentation, and I understand that this issue is due to how tab_stat_cpct()
defines a "case" for the purposes of computing the percentage, but I don't see an argument that will adjust that definition to do what I need. Am I missing something? Or is there some other way of reporting accurate percentages? Thanks!
set.seed(123)
data <- data.frame(
Age = sample(c("25-34", "35-54", "55+"), 25, replace = TRUE),
BrandA = c(1, 0, 0, 1, 0, 1, NA, NA, NA, NA, NA, NA, NA, 1,
0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1),
BrandB = c(NA, NA, NA, NA, NA, NA, 1, 1, 0, 1, 0, 1, 1, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
BrandC = c(1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0,
1, 1, 1, 0, 1, 0, 1, 0, 1)
)
data %>%
tab_cells(mrset(as.category(BrandA %to% BrandC))) %>%
tab_cols(total(), Age) %>%
tab_stat_cpct() %>%
tab_last_sig_cpct() %>%
tab_pivot()
## | | #Total | Age | | |
## | | | 25-34 | 35-54 | 55+ |
## | | | A | B | C |
## | ------------ | ------ | ------- | ----- | ---- |
## | BrandA | 52.4 | 83.3 B | 28.6 | 50.0 |
## | BrandB | 23.8 | | 42.9 | 25.0 |
## | BrandC | 71.4 | 100.0 C | 71.4 | 50.0 |
## | #Total cases | 21 | 6 | 7 | 8 |
It is considered that all items in the multiple response sets have the same base. Base for mdset
is the number of cases in which we have at least one non-empty item (item with with value 1). That's why base for your brands is 21. If we will treat each item separately then we need to show total for each item to calculate significance. In many cases it is very inconvenient.
In your situation you can use the following function:
library(expss)
tab_stat_dich = function(data, total_label = NULL, total_statistic = "u_cases",
label = NULL){
if (missing(total_label) && !is.null(data[["total_label"]])) {
total_label = data[["total_label"]]
}
if(is.null(total_label)){
total_label = "#Total"
}
# calculate means
res = eval.parent(
substitute(
tab_stat_mean_sd_n(data, weighted_valid_n = "w_cases" %in% total_statistic,
labels = c("|", "@@@@@", total_label),
label = label)
)
)
curr_tab = res[["result"]][[length(res[["result"]])]]
# drop standard deviation
curr_tab = curr_tab[c(TRUE, FALSE, TRUE), ]
# convert means to percent
curr_tab[c(TRUE, FALSE), -1] = curr_tab[c(TRUE, FALSE), -1] * 100
## clear row labels
curr_tab[[1]] = gsub("^(.+?)\\|(.+)$", "\\2", curr_tab[[1]], perl = TRUE )
res[["result"]][[length(res[["result"]])]] = curr_tab
res
}
set.seed(123)
data <- data.frame(
Age = sample(c("25-34", "35-54", "55+"), 25, replace = TRUE),
BrandA = c(1, 0, 0, 1, 0, 1, NA, NA, NA, NA, NA, NA, NA, 1,
0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1),
BrandB = c(NA, NA, NA, NA, NA, NA, 1, 1, 0, 1, 0, 1, 1, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
BrandC = c(1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0,
1, 1, 1, 0, 1, 0, 1, 0, 1)
)
res = data %>%
tab_cells(BrandA %to% BrandC) %>%
tab_cols(total(), Age) %>%
tab_stat_dich() %>%
tab_last_sig_cpct() %>%
tab_pivot()
res
# | | #Total | Age | | |
# | | | 25-34 | 35-54 | 55+ |
# | | | A | B | C |
# | ------ | ------ | ----- | ------ | ---- |
# | BrandA | 61.1 | 71.4 | 83.3 C | 20.0 |
# | #Total | 18 | 7 | 6 | 5 |
# | BrandB | 71.4 | 100.0 | 66.7 | 50.0 |
# | #Total | 7 | 2 | 3 | 2 |
# | BrandC | 60.0 | 55.6 | 66.7 | 57.1 |
# | #Total | 25 | 9 | 9 | 7 |
# if we want to drop totals
where(res, !grepl("#", row_labels))
# | | #Total | Age | | |
# | | | 25-34 | 35-54 | 55+ |
# | | | A | B | C |
# | ------ | ------ | ----- | ------ | ---- |
# | BrandA | 61.1 | 71.4 | 83.3 C | 20.0 |
# | BrandB | 71.4 | 100.0 | 66.7 | 50.0 |
# | BrandC | 60.0 | 55.6 | 66.7 | 57.1 |