Search code examples
rquantedaread-text

Import texts and docvars from XML file with readtext package


I'm trying to import texts from xml files with readtext package in order to then create and explore a corpus with quanteda. Reading the help page I've figured out how to import the texts, but I'd like to know if one can create docvars based on nodes attributes from the xml files.

Let's imagine a XML file :

<corpus>
  <text author="Bill" date="1928-05-27">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed non risus. Suspendisse lectus tortor, dignissim sit amet, adipiscing nec, ultricies sed, dolor. Cras elementum ultrices diam. Maecenas ligula massa, varius a, semper congue, euismod non, mi. Proin porttitor, orci nec nonummy molestie, enim est eleifend mi, non fermentum diam nisl sit amet erat. Duis semper. Duis arcu massa, scelerisque vitae, consequat in, pretium a, enim. Pellentesque congue. Ut in risus volutpat libero pharetra tempor. Cras vestibulum bibendum augue.
  </text>
</corpus>

You can import the text node's content as text field using a xpath expression :

library(readtext)
texts <- readtext("file.xml", text_field = ".//text", encoding = "utf-8", verbosity = 3)

But I don'k know if one can get node attributes as docvars (author and date in the present case) ?

If so, help to achieve that would be really nice !


Solution

  • readtext() itself doesn't seem to support it, but assuming there's a single corpus per file, you could extract attributes with xml2 and then add those to readtext object :

    library(readtext)
    library(xml2)
    library(dplyr)
    
    ## for a single file:
    texts <- readtext("file.xml", text_field = ".//text", encoding = "utf-8", verbosity = 3)
    #> Reading texts from file.xml
    #> , using glob pattern
    #>  ... reading (xml) file: file.xml
    #>  ... read 1 document
    read_xml("file.xml") %>% 
      xml_find_first(".//text") %>% 
      xml_attrs() %>% 
      as.list() %>% 
      bind_cols(texts, .)
    #> readtext object consisting of 1 document and 2 docvars.
    #> # Description: df [1 × 4]
    #>   doc_id   text                 author date      
    #>   <chr>    <chr>                <chr>  <chr>     
    #> 1 file.xml "\"\nLorem ips\"..." Bill   1928-05-27
    
    ## for a list of files:
    library(purrr)
    list.files(pattern = "file.*\\.xml") %>% 
      map(\(x) 
          bind_cols(
            readtext(x, text_field = ".//text", encoding = "utf-8"),
            read_xml(x) %>%  xml_find_first(".//text") %>%  xml_attrs() %>%  as.list())
          ) %>% 
      list_rbind()
    #> readtext object consisting of 2 documents and 2 docvars.
    #> # Description: df [2 × 4]
    #>   doc_id    text                 author date      
    #>   <chr>     <chr>                <chr>  <chr>     
    #> 1 file.xml  "\"\nLorem ips\"..." Bill   1928-05-27
    #> 2 file2.xml "\"\nSed non r\"..." Gill   1998-05-27
    

    Created on 2023-03-16 with reprex v2.0.2