Search code examples
rone-hot-encodinggtsummary

One-hot-encoded table to tbl_summary


I have a one-hot-encoded tibble that contains the occurrence of certain arteries to in specific tumours. These can be divided in 2 groups: primary and secondary. If there would be only 1 occurrence for every vessel possible per tumour, I would have no problem. However, some vessels have the same name (for example 1 for left and 1 for right side) and can occur 2 times per tumour.

To simplify the problem:

test <- tibble(
  group = c("primary", "secondary", "secondary", "primary"),
  common_carotid = as.numeric(c(1, 2, 0, 0)),
  internal_carotid = as.numeric(c(0, 1, 0, 1))
)

test |>
    tbl_summary(
      by = group,
      missing = "no"
    ) %>%
    add_p() %>%
    modify_header(label ~ "**Vessel**")

Gives

enter image description here

I would like to have a tbl_summary where common_carotid would show as 1 row like this but then with the correct numbers: enter image description here

The desired output is:

| Vessel           | Primary, N=2 | Secondary N=3 | p-value |
| Common_carotid   | 1 (50%)      | 2 (67%)       | >0.9    |
| internal_carotid | 1 (50%)      | 1 (33%)       | >0.9    |

I was able to create this table changing the value of 2 in the common_carotid group to 1. It seems that gtsummary treats these variables als factors and show the different levels although I tried to correct this to numbers.

It seems like a simple problem but who has the answer?

This is a part of the original data:

dput(head(data, 20))
structure(list(ID = c(23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 
33, 34, 35, 54, 55, 56, 57, 58, 59, 60), genesis_group = c("Secondary", 
"Secondary", "Secondary", "Secondary", "Secondary", "Primary", 
"Secondary", "Secondary", "Secondary", "Secondary", "Primary", 
"Secondary", "Secondary", "Primary", "Secondary", "Secondary", 
"Primary", "Primary", "Primary", "Primary"), Arteries = structure(c(21L, 
96L, 39L, 79L, 96L, 32L, 59L, 79L, NA, 85L, 48L, 59L, 48L, 100L, 
100L, 100L, 100L, 100L, 101L, 101L), levels = c("APA", "ASCA", 
"ASCA RA", "BCA", "DCA", "DCA RA", "DTA STA OA", "LAPA LOA", 
"LAPA RAPA RA", "LBCA", "LCCA", "LCCA LICA", "LCCA LICA LVA", 
"LDCA LOA", "LDCA LVA LASCA", "LDTA", "LECA", "LECA LSTA LEA REA LMMA", 
"LFA", "LIAA LSMA", "LICA", "LICA LMA", "LICA RECA", "LMA", "LMA LOA LSTA", 
"LMA LOPA DNA", "LOA", "LOA LAPA LDCA", "LOA LPAA LSTA", "LOA LSTA LEA", 
"LOA ROA LSTA RSTA", "LOPA", "LOPA LSTA", "LRA", "LSA", "LSA CCA", 
"LSOA RSOA RSTA LSTA", "LSPA", "LSTA", "LSTA LDTA", "LSTA LMA LMMA", 
"LSTA LOA", "LSTA LSTRA", "LSTA RSTA LSOA LPAA", "LSTA RSTA ROPA LOPA", 
"LSTHA", "LTT", "LVA", "LVA ECA", "LVA LECA", "MA", "OA", "OA PAA", 
"OA TT", "OPA", "PA", "RA", "RBCA", "RCCA", "RCCA RVA", "REA", 
"RECA", "RECA PAA", "RECA RFA", "RECA RSA RCCT RTT", "RFA", "RFA RMA", 
"RICA", "RICA RECA", "RITT", "RLA", "RMA", "ROA", "ROA RPAA", 
"ROA RSTA", "ROA RVA", "ROPA", "ROPA RMA", "RSA", "RSA TT", "RSCA", 
"RSOA", "RSOA RSTA", "RSPA", "RSTA", "RSTA LOA ROA", "RSTA LSTA", 
"RSTA RFA", "RSTA RMA RPAA", "RSTA ROA", "RSTA RPAA", "RSTHA", 
"RSUBLA ANGA", "RTA", "RTT", "RVA", "RVA ROA", "RVA RTT", "SLA", 
"STA", "STA OA", "STA OA OPA STA", "STA OPA", "STA PAA", "STA SOA", 
"TT", "VA", "VA APA DCA"), class = "factor"), Veins = c("LIJV", 
"RIJV", NA, "RIJV", "RVV", "LIOV", "REJV", "RSV", NA, NA, "LVV", 
"LIJV", "LIJV", NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

Solution

  • Identify all possible arteries and vessels entered using unique:

    unique(sub('^[L|R]', '', data$Arteries))
    # [1] "ICA"    "VA"     "STA"    "SA"     "OPA"    "CCA"    
    # [2] NA       "STA OA"
    
    unique(sub('^[L|R]', '', data$Veins))
    #[1] "IJV" NA    "VV"  "IOV" "EJV" "SV"
    

    Then reframe the data to create the columns you want to summarise using grepl, ignoring the L or R side:

    data |> 
      reframe(
        #Arteries
        'Common carotid'=grepl("[L|R]?CCA", Arteries),
        'Internal carotid'=grepl("[L|R]?ICA", Arteries),
        'Superficial temporal'=grepl("[L|R]?STA", Arteries),
        'Venous'=grepl("[L|R]?VA", Arteries),
        'Sinoatrial'=grepl("[L|R]?SA", Arteries),
        'Ophthalmic'=grepl("[L|R]?OPA", Arteries),
        # Veins
        'Internal jugular'=grepl("[L|R]?IJV", Veins),
        'External jugular'=grepl("[L|R]?EJV", Veins),
        'Vascular'=grepl("[L|R]?VV", Veins),
        'Splenic'=grepl("[L|R]?SV", Veins),
        'Inferior orbital'=grepl("[L|R]?IOV", Veins),
        .by=genesis_group) |>
      tbl_summary(
        by = genesis_group,
        missing = "no"
      ) %>%
      add_p() %>%
      modify_header(label ~ "**Vessel**") |>
      modify_table_body(~ .x |>
                          bind_rows(tibble(
                            variable=c("Arteries","Veins"),
                            var_type=NA,
                            var_label=c("Arteries","Veins"),
                            row_type="label",
                            label=c("Arteries","Veins"),
                            stat_0= NA)
                          ) |>
       arrange(factor(variable, 
                      levels=c("Arteries",
                               "Common carotid",
                               "Internal carotid",
                               "Venous",
                               "Superficial temporal",
                               "Sinoatrial",
                               "Ophthalmic",
                               "Veins")))) |>
      modify_column_indent(columns=label,
                     rows=!variable %in% c("Arteries","Veins"))
    

    enter image description here