I have a very large xml file of around 45 GB that I am trying to parse and create a data frame out of. The xml has a fairly simple structure to it as shown below. I want to read the attributes under the tag <event>
when the type is either entered link or left link.
<?xml version="1.0" encoding="utf-8"?>
<events version="1.0">
<event time="10800.0" type="actend" person="9982471" link="21225" actType="home" />
<event time="10800.0" type="departure" person="9982471" link="21225" legMode="car" />
<event time="10800.0" type="PersonEntersVehicle" person="9982471" vehicle="9982471" />
<event time="10800.0" type="actend" person="9656271" link="21066" actType="home" />
<event time="10833.0" type="entered link" person="4250461" link="24329" vehicle="4250461" />
<event time="10835.0" type="left link" person="1662941" link="29242" vehicle="1662941" />
<event time="10835.0" type="entered link" person="1662941" link="29239" vehicle="1662941" />
<event time="10836.0" type="left link" person="7651702" link="7359" vehicle="7651702" />
<event time="10836.0" type="entered link" person="7651702" link="7407" vehicle="7651702" />
<event time="10840.0" type="left link" person="8909152" link="5664" vehicle="8909152" />
</events>
I have tried the DOM based xmlparse()
function but it is not helpful due to memory issues. Then, I tried a SAX based code (shown below) but it is taking too long. For example, to read a 1% sample and create a dataframe from it, it took me about 5 hours. So, to do the same with the full data it would take me about 20 days (assuming it could be scaled linearly). Can you please help me solve this issue? Here are the links to a very small sample, 1% sample, 5% sample, and the full data.
Here is the SAX code I used.
library(XML)
branchFunction <- function() {
store <- new.env()
new_counter <- (function() {
i <- 0
function() {
i <<- i + 1
i
}
})()
func <- function(x, ...) {
ns <- getNodeSet(x,path = "//event[@type='left link' or @type='entered link']")
value <- lapply(ns, xmlAttrs)
store[[toString(new_counter())]] <- value
}
getStore <- function() { as.list( store ) }
list(event = func, getStore=getStore)
}
myfunctions <- branchFunction()
xmlEventParse(file = "percent1.gz", handlers = NULL, branches = myfunctions)
l <- myfunctions$getStore()
l <- unlist(l, recursive = FALSE)
df <- data.frame(matrix(unlist(l), nrow=length(l), byrow=T),stringsAsFactors=FALSE)
colnames(df) <- c("time", "type", "person", "link", "carid")
The output must look like this
> head(df)
time type person link carid
1 10934.0 entered link 9656271 16260 9656271
2 10935.0 left link 8909152 6014 8909152
3 10935.0 entered link 8909152 6034 8909152
4 10936.0 left link 1504062 25541 1504062
5 10936.0 entered link 1504062 25384 1504062
6 10936.0 left link 3055801 31464 3055801
Using Saxon, the following XSLT 3.0 stylesheet processed your percent1 sample (413Mb) in 14.5 seconds:
<xsl:stylesheet version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
expand-text="yes">
<xsl:template name="xsl:initial-template">
<xsl:stream href="percent1.xml">
<xsl:for-each select="/events/event[@type=('entered link', 'left link')]">
{position()} {@time} {@person} {@link} {@person}
</xsl:for-each>
</xsl:stream>
</xsl:template>
</xsl:stylesheet>
Memory usage was 13Mb (which isn't going to increase as the file size increases). Extrapolated time for the full dataset about 25 minutes.