Search code examples
rnlpdocx

Column header as row value


I am extracting tables from word documents using the docxtractr package, but one of my tables is not turning out well.

After extraction, it looks like

Column A Column A value
Column B value
Column C Column C value

and I want it to look like

Column A Column B Column C
Column A value Column B value Column C value

Is there a way to format table 1 to table 2?

Or perhaps a better way of extracting the values/tables from the word document?

TIA

I'm still looking for solutions.


Solution

  • If d is your table ...

    ## create example table:
    d <- structure(list(Var1 = c("Column A", NA, "Column C"), Var2 = c("Column A value", 
    "Column B value", "Column C value")), class = "data.frame", row.names = c(NA, 
    3L))
    
    > d
          Var1           Var2
    1 Column A Column A value
    2     <NA> Column B value
    3 Column C Column C value
    

    ... you can use {dplyr} and {tidyr} to substitute missing column names and reshape to wide format like this:

    library(dplyr)
    library(tidyr)
    
    d |>
    mutate(Var1 = ifelse(is.na(Var1), paste0('Column_', row_number()), Var1)) |>
    pivot_wider(names_from = Var1, values_from = Var2)
    
      `Column A`     Column_2       `Column C`    
      <chr>          <chr>          <chr>         
    1 Column A value Column B value Column C value
    

    You might need to set the header argument to FALSE upon import: docx_extract_tbl(..., header = FALSE, ...)