Search code examples
rsubtotalexpss

Influence of subtotals on significance tests in expss tables


Hello to R/expss experts! This is a follow-up question to this one --> Complex tables with expss package.

I added subtotals to already complex tables using the excellent expss package, and it works well for most tasks (counts, proportions, means...). Yet, I found out statistical tests evaluation differ between one table without subtotals and the exact same with subtotals. @Gregory Demin, your knowledge would be greatly appreciated :)

An example to illustrate my words, using the infert dataset available in the datasets package:

example <- infert %>%
  tab_significance_options(sig_level=0.2, keep="none", sig_labels=NULL, subtable_marks="greater", mode="append") %>%
  tab_cols(total(), education) %>%
  tab_cells(parity) %>%
  # block for cases
  tab_stat_cases(label="N", total_row_position="above", total_statistic="u_cases", total_label="TOTAL") %>% 
  tab_last_add_sig_labels() %>%
  # block for percent statistic - Subtable tests  
  tab_stat_cpct(label="%Col.", total_row_position="above", total_statistic="u_cpct", total_label="TOTAL") %>%
  tab_last_add_sig_labels() %>%
  tab_last_sig_cpct(label="T.1", compare_type="subtable") %>%
  # block for percent statistic - First column tests
  tab_stat_cpct(label="T.2", total_row_position="above", total_statistic="u_cpct", total_label="TOTAL") %>%
  tab_last_add_sig_labels() %>%
  tab_last_sig_cpct(compare_type="first_column", mode="replace") %>%
  tab_pivot(stat_position="inside_columns") %>%
  # converts NA to zero
  recode(as.criterion(is.numeric) & is.na ~ 0, TRUE ~ copy)
example <- example[,-c(4,5)]
print(example)

Note: sig_level is very high (20%) to illustrate this specific issue, do not panic :) This is the starting point and I am fine with that. Then we only add the subtotals (line 5):

example2 <- infert %>%
  tab_significance_options(sig_level=0.2, keep="none", sig_labels=NULL, subtable_marks="greater", mode="append") %>%
  tab_cols(total(), education) %>%
  tab_cells(parity) %>%
  tab_subtotal_cells("#FIRST 3"=c(1,2,3),"#LAST 3"=c(4,5,6), position = "above") %>%
  # block for cases
  tab_stat_cases(label="N", total_row_position="above", total_statistic="u_cases", total_label="TOTAL") %>% 
  tab_last_add_sig_labels() %>%
  # block for percent statistic - Subtable tests  
  tab_stat_cpct(label="%Col.", total_row_position="above", total_statistic="u_cpct", total_label="TOTAL") %>%
  tab_last_add_sig_labels() %>%
  tab_last_sig_cpct(label="T.1", compare_type="subtable") %>%
  # block for percent statistic - First column tests
  tab_stat_cpct(label="T.2", total_row_position="above", total_statistic="u_cpct", total_label="TOTAL") %>%
  tab_last_add_sig_labels() %>%
  tab_last_sig_cpct(compare_type="first_column", mode="replace") %>%
  tab_pivot(stat_position="inside_columns") %>%
  # converts NA to zero
  recode(as.criterion(is.numeric) & is.na ~ 0, TRUE ~ copy)
example2 <- example2[,-c(4,5)]
print(example2)

I do not know what is happening, but the results of significance tests are not the same this time. Besides, I feel no significance test is calculated on the two subtotal rows. Any insight?


Solution

  • For significance testing between percents we need cases in the total statistic. So we will make total statistic with two rows. After all manipulations rows with total cases will be deleted. significance_cpct use # sign to detect total rows. And # in subtotals leads to incorrect results.

    Taking into account all above:

    example2 <- infert %>%
        tab_significance_options(sig_level=0.2, keep="none", sig_labels=NULL, subtable_marks="greater", mode="append") %>%
        tab_cols(total(), education) %>%
        tab_cells(parity) %>%
        tab_subtotal_cells("FIRST 3"=c(1,2,3),"LAST 3"=c(4,5,6), position = "above") %>%
        # block for cases
        tab_stat_cases(label="N", total_row_position="above", total_statistic="u_cases", total_label="TOTAL") %>% 
        tab_last_add_sig_labels() %>%
        # block for percent statistic - Subtable tests  
        # note additional total statistic
        tab_stat_cpct(label="%Col.", total_row_position="above", total_statistic= c("u_cases", "u_cpct"), 
                      total_label=c("TO DELETE", "TOTAL")) %>%
        tab_last_add_sig_labels() %>%
        tab_last_sig_cpct(label="T.1", compare_type="subtable") %>%
        # block for percent statistic - First column tests
        tab_stat_cpct(label="T.2", total_row_position="above", total_statistic= c("u_cases", "u_cpct"), 
                      total_label=c("TO DELETE", "TOTAL")) %>%
        tab_last_add_sig_labels() %>%
        tab_last_sig_cpct(compare_type="first_column", mode="replace") %>%
        tab_pivot(stat_position="inside_columns") %>%
        # drop row with TO_DELETE
        where(!grepl("TO DELETE", row_labels)) %>% 
        # converts NA to zero
        recode(as.criterion(is.numeric) & is.na ~ 0, TRUE ~ copy)
    example2 <- example2[,-c(4,5)]
    print(example2)
    

    UPDATE with net on columns:

    data(infert)
    example2 <- infert %>%
        apply_labels(
            education = "Education"
        ) %>% 
        tab_significance_options(sig_level=0.2, keep="none", sig_labels=NULL, subtable_marks="greater", mode="append") %>%
        tab_cols(total(), net(education, "LESS THAN 12 Y.O."=levels(education)[1:2])) %>%
        tab_cells(parity) %>%
        tab_subtotal_cells("FIRST 3"=c(1,2,3),"LAST 3"=c(4,5,6), position = "above") %>%
        # block for cases
        tab_stat_cases(label="N", total_row_position="above", total_statistic="u_cases", total_label="TOTAL") %>% 
        tab_last_add_sig_labels() %>%
        # block for percent statistic - Subtable tests  
        # note additional total statistic
        tab_stat_cpct(label="%Col.", total_row_position="above", total_statistic= c("u_cases", "u_cpct"), 
                      total_label=c("TO DELETE", "TOTAL")) %>%
        tab_last_add_sig_labels() %>%
        tab_last_sig_cpct(label="T.1", compare_type="subtable") %>%
        # block for percent statistic - First column tests
        tab_stat_cpct(label="T.2", total_row_position="above", total_statistic= c("u_cases", "u_cpct"), 
                      total_label=c("TO DELETE", "TOTAL")) %>%
        tab_last_add_sig_labels() %>%
        tab_last_sig_cpct(compare_type="first_column", mode="replace") %>%
        tab_pivot(stat_position="inside_columns") %>%
        # drop row with TO_DELETE
        where(!grepl("TO DELETE", row_labels)) %>% 
        # converts NA to zero
        recode(as.criterion(is.numeric) & is.na ~ 0, TRUE ~ copy)
    example2 <- example2[,-c(4,5)]
    print(example2)