Search code examples
rxmlxml2

read and parse a xml in chunks in R


I'm trying to read and process a ~5.8GB .xml from Wikipedia Dumps using R. I don't have so much RAM so I would like to process it in chunks. (Currently when using xml2::read_xml blocks my computer completely)

The file contais one xml element for each wikipedia page, like this:

<page>
    <title>AccessibleComputing</title>
    <ns>0</ns>
    <id>10</id>
    <redirect title="Computer accessibility" />
    <revision>
      <id>631144794</id>
      <parentid>381202555</parentid>
      <timestamp>2014-10-26T04:50:23Z</timestamp>
      <contributor>
        <username>Paine Ellsworth</username>
        <id>9092818</id>
      </contributor>
      <comment>add [[WP:RCAT|rcat]]s</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">#REDIRECT [[Computer accessibility]]

{{Redr|move|from CamelCase|up}}</text>
      <sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
    </revision>
</page>

A sample of the file can be found here

From my perspective, I would think It's possible to read it in chunks, something like page per page in the file. Ans save each processed page element as a line in a .csvfile.

I would like to have a data.frame with the following columns.

id, title and text.

How can I do to read this .xml in chunks?


Solution

  • It can be improved, but the main ideia is here. You still need to define the best way to define the amount of lines you're going to read in each interaction inside the readLines() function and also a method to read each chunk, but a solution for getting the chunks are here:

    xml <- readLines("ptwiki-20161101-pages-articles.xml", n = 2000)
    
    inicio <- grep(pattern = "<page>", x = xml)
    fim <- grep(pattern = "</page>", x = xml)
    if (length(inicio) > length(fim)) { # if you get more beginnings then ends
      inicio <- inicio[-length(inicio)] # drop the last one
    }
    
    chunks <- vector("list", length(inicio))
    
    for (i in seq_along(chunks)) {
      chunks[[i]] <- xml[inicio[i]:fim[i]]
    }
    
    chunks <- sapply(chunks, paste, collapse = " ")
    

    I've tried read_xml(chunks[1]) %>% xml_nodes("text") %>% xml_text() and it worked out.