Search code examples
xmlrsax

R What to do when both DOM and SAX parsers fail?


I have a very large xml file of around 45 GB that I am trying to parse and create a data frame out of. The xml has a fairly simple structure to it as shown below. I want to read the attributes under the tag <event> when the type is either entered link or left link.

<?xml version="1.0" encoding="utf-8"?>
<events version="1.0">
    <event time="10800.0" type="actend" person="9982471" link="21225" actType="home"  />
    <event time="10800.0" type="departure" person="9982471" link="21225" legMode="car"  />
    <event time="10800.0" type="PersonEntersVehicle" person="9982471" vehicle="9982471"  />
    <event time="10800.0" type="actend" person="9656271" link="21066" actType="home"  />
    <event time="10833.0" type="entered link" person="4250461" link="24329" vehicle="4250461"  />
    <event time="10835.0" type="left link" person="1662941" link="29242" vehicle="1662941"  />
    <event time="10835.0" type="entered link" person="1662941" link="29239" vehicle="1662941"  />
    <event time="10836.0" type="left link" person="7651702" link="7359" vehicle="7651702"  />
    <event time="10836.0" type="entered link" person="7651702" link="7407" vehicle="7651702"  />
    <event time="10840.0" type="left link" person="8909152" link="5664" vehicle="8909152"  />
</events>

I have tried the DOM based xmlparse() function but it is not helpful due to memory issues. Then, I tried a SAX based code (shown below) but it is taking too long. For example, to read a 1% sample and create a dataframe from it, it took me about 5 hours. So, to do the same with the full data it would take me about 20 days (assuming it could be scaled linearly). Can you please help me solve this issue? Here are the links to a very small sample, 1% sample, 5% sample, and the full data.

Here is the SAX code I used.

library(XML)

branchFunction <- function() {
  store <- new.env() 
  new_counter <- (function() {
    i <- 0
    function() { 
      i <<- i + 1
      i
    }
  })()
  func <- function(x, ...) {
    ns <- getNodeSet(x,path = "//event[@type='left link' or @type='entered link']")
    value <- lapply(ns, xmlAttrs)
    store[[toString(new_counter())]] <- value
  }
  getStore <- function() { as.list( store ) }
  list(event = func, getStore=getStore)
}

myfunctions <- branchFunction()

xmlEventParse(file = "percent1.gz", handlers = NULL, branches = myfunctions)
l <- myfunctions$getStore()
l <- unlist(l, recursive = FALSE)
df <- data.frame(matrix(unlist(l), nrow=length(l), byrow=T),stringsAsFactors=FALSE)
colnames(df) <- c("time", "type", "person", "link", "carid")

The output must look like this

> head(df)
     time         type  person  link   carid
1 10934.0 entered link 9656271 16260 9656271
2 10935.0    left link 8909152  6014 8909152
3 10935.0 entered link 8909152  6034 8909152
4 10936.0    left link 1504062 25541 1504062
5 10936.0 entered link 1504062 25384 1504062
6 10936.0    left link 3055801 31464 3055801

Solution

  • Using Saxon, the following XSLT 3.0 stylesheet processed your percent1 sample (413Mb) in 14.5 seconds:

    <xsl:stylesheet version="3.0" 
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      xmlns:xs="http://www.w3.org/2001/XMLSchema"
      exclude-result-prefixes="xs"
      expand-text="yes">
    
      <xsl:template name="xsl:initial-template">
        <xsl:stream href="percent1.xml">
          <xsl:for-each select="/events/event[@type=('entered link', 'left link')]">
            {position()} {@time} {@person} {@link} {@person}
          </xsl:for-each>
        </xsl:stream>
      </xsl:template>
    
    </xsl:stylesheet>
    

    Memory usage was 13Mb (which isn't going to increase as the file size increases). Extrapolated time for the full dataset about 25 minutes.