Is it possible to read the text of the headers and footers from a DOCX file with R?
I tried using the officer package for R. By using read_docx() and docx_summary() functions I was able to get the text from the paragraphs and tables in the main body of the document. But the text from the page headers and footers was not extracted by docx_summary(). Is it stored somewhere in the object created by read_docx()? How could I get at the headers and footers?
TBMK the officer
package has no convenience function to extract information about the headers or footers. But the information is available in the rdocx
object, i.e. if you just want the header or footer text you can do:
library(officer)
library(xml2)
x <- read_docx(fn)
lapply(x$headers, \(x) {
x$get() |>
xml_find_all("//w:hdr") |>
xml_text()
})
#> $header1.xml
#> [1] "First section header"
#>
#> $header2.xml
#> [1] "Second section header"
lapply(x$footers, \(x) {
x$get() |>
xml_find_all("//w:ftr") |>
xml_text()
})
#> $footer1.xml
#> [1] "First section footer"
#>
#> $footer2.xml
#> [1] "Second section footer"
Example docx file
library(officer)
txt_lorem <- rep(
"Purus lectus eros metus turpis mattis platea praesent sed. ",
50
)
txt_lorem <- paste0(txt_lorem, collapse = "")
header_first <- block_list(fpar(ftext("First section header")))
footer_first <- block_list(fpar(ftext("First section footer")))
ps <- prop_section(
header_default = header_first, footer_default = footer_first,
)
x <- read_docx()
for (i in seq(3)) {
x <- body_add_par(x, value = txt_lorem)
}
x <- body_end_block_section(
x,
value = block_section(ps)
)
for (i in seq(3)) {
x <- body_add_par(x, value = txt_lorem)
}
header_second <- block_list(fpar(ftext("Second section header")))
footer_second <- block_list(fpar(ftext("Second section footer")))
ps <- prop_section(
header_default = header_second, footer_default = footer_second
)
x <- body_end_block_section(
x,
value = block_section(ps)
)
print(x, fn <- tempfile(fileext = ".docx"))