Search code examples
rfontsofficer

How to get formatting of styles in Word document in officer?


I want to write some documents with officer and I have some predefined styles in my word document that I load with read_docx(). Now I can look at the styles but I especially want to know which font type or which font size each style has and I cannot find that. This is all I can find:

Document <- read_docx(FILEPATH)
head(Document$styles)
  style_type style_id style_name is_custom is_default
1  paragraph   Normal     Normal     FALSE       TRUE
2  paragraph Heading1  heading 1     FALSE      FALSE
3  paragraph Heading2  heading 2     FALSE      FALSE
4  paragraph Heading3  heading 3     FALSE      FALSE
5  paragraph Heading4  heading 4     FALSE      FALSE
6  paragraph Heading5  heading 5     FALSE      FALSE

Unfortunately there is no column with the font size or font type. I really need to have the font size (for example 10) and font type (for example "Times New Roman") of heading 1 in R because the argument style of the function body_add_par is not enough for my purposes. Is there a way to get this?

Edit: It also would be great if the solution is not from officer.


Solution

  • I couldn't find a way to do this in officer. In fact, in the end I had to parse the xml contents of the docx to get the fonts.

    It turns out that not all styles have a font set. Some inherit from other styles, and some just take the default value given by Word. Anyway, parsing the xml is pretty involved, so this is a bit involved / messy.

    First you need to unzip the docx to get its style xml. If you have officer you will also have the required zip package, so we'll use this:

    library(zip)
    doc_path <- "my_file_path.docx"
    unzip(doc_path, files = "word/styles.xml", exdir = path.expand("~/"))
    

    Now we need to parse the xml:


    Edit

    As pointed out in the comments by @TobiSonne, the sz values are in half points, not points, so we need to half them to get the fonts' point sizes.


    read_xml(path.expand("~/word/styles.xml")) %>%
    xml_nodes(xpath = "//w:style")             %>%
    lapply(xml_new_root)                       %>%
    lapply(function(x) data.frame(
      name = x %>% xml_node(xpath = "//w:name") %>% xml_attr("val"),
      based_on = x %>% xml_node(xpath = "//w:basedOn") %>% xml_attr("val"),
      font = x %>% xml_node(xpath = "//w:rFonts") %>% xml_attr("ascii"),
      size = x %>% xml_node(xpath = "//w:sz") %>% xml_attr("val") %>% as.numeric() %>% `/`(2),
      stringsAsFactors = F)) %>%
    {do.call("rbind", .)} -> font_table
    
    

    This gives us the font table, but there are lots of missing values to infer from inheritance etc:

    
    read_xml(path.expand("~/word/styles.xml")) %>%
    xml_node(xpath = "//w:docDefaults//w:rPr") %>% 
    xml_new_root -> defaults
    
    default_size <- xml_node(defaults, xpath = "//w:sz") %>% 
                    xml_attr("val") %>%
                    as.numeric() %>%
                    `/`(2)
    default_font <- xml_node(defaults, xpath = "//w:rFonts") %>% xml_attr("ascii")
    if(is.na(default_font))
      default_font <- xml_node(defaults, xpath = "//w:rFonts") %>% xml_attr("asciiTheme")
    
    font_table$size[is.na(font_table$size) & is.na(font_table$based_on)] <- default_size
    font_table$font[is.na(font_table$font)] <- default_font
    font_table$based_on[is.na(font_table$based_on)] <- "default"
    

    Now we have:

    font_table
    #>                      name             based_on            font size
    #> 1                  Normal              default      minorHAnsi   12
    #> 2               heading 2               Normal      minorHAnsi   13
    #> 3  Default Paragraph Font              default      minorHAnsi   12
    #> 4            Normal Table              default      minorHAnsi   12
    #> 5                 No List              default      minorHAnsi   12
    #> 6              Table Grid          TableNormal      minorHAnsi <NA>
    #> 7          List Paragraph               Normal      minorHAnsi <NA>
    #> 8            Normal (Web)               Normal Times New Roman <NA>
    #> 9            Balloon Text               Normal          Tahoma    8
    #> 10      Balloon Text Char DefaultParagraphFont          Tahoma    8
    #> 11         Heading 2 Char DefaultParagraphFont      minorHAnsi   13