r xml memory-management xml-parsing large-files

R: Memory Management during xmlEventParse of Huge (>20GB) files

Building on this previous question (see here), I am attempting to read in many, large xml files via xmlEventParse whilst saving node-varying data. Working with this sample xml: https://www.nlm.nih.gov/databases/dtd/medsamp2015.xml.

The code below uses xpathSapply to extract the necessary values and a series of if statements to combine the values in a way that matches the unique value (PMID) to each of the non-unique values (LastName) within a record - for which there may be no LastNames. The goal is to write a series of small csv's along the way (here, after every 1000 LastNames) to minimize the amount of memory used.

When run on the full-sized data set, the code successfully outputs files in batches, however something is still being stored in memory that eventually causes a system error once all RAM is used. I've watched the task manager while the code runs and can see R's memory grow as the program progresses. And if I stop the program mid-run and then clear the R workspace, including hidden items, the memory still appears to be in use by R. It is not until I shutdown R is the memory freed up again.

Run this a few times yourself and you'll see R's memory usage grow even after clearing the workspace.

Please help! This problem appears to be common to others reading in large XML files in this manner (See for example comments in this question).

My code is as follows:

library(XML)

filename <- "~/Desktop/medsamp2015.xml"

tempdat <- data.frame(pmid=as.numeric(),
                      lname=character(), 
                      stringsAsFactors=FALSE) 
cnt <- 1
branchFunction <- function() {
  func <- function(x, ...) {
    v1 <- xpathSApply(x, path = "//PMID", xmlValue)
    v2 <- xpathSApply(x, path = "//Author/LastName", xmlValue)
    print(cbind(c(rep(v1,length(v2))), v2))

    #below is where I store/write the temp data along the way
    #but even without doing this, memory is used (even after clearing)

    tempdat <<- rbind(tempdat,cbind(c(rep(v1,length(v2))), v2))
    if (nrow(tempdat) > 1000){
      outname <- paste0("~/Desktop/outfiles",cnt,".csv")
      write.csv(tempdat, outname , row.names = F)
      tempdat <<- data.frame(pmid=as.numeric(),
                            lname=character(), 
                            stringsAsFactors=FALSE)
      cnt <<- cnt+1
    }
  }
  list(MedlineCitation = func)
}

myfunctions <- branchFunction()

#RUN
xmlEventParse(
  file = filename, 
  handlers = NULL, 
  branches = myfunctions
)

Solution

Here is an example, we have a launch script invoke.sh, that calls an R Script and passes the url and filename as parameters... In this case, I had previously downloaded the test file medsamp2015.xml and put in the ./data directory.

My sense would be to create a loop in the invoke.sh script and iterate through the list of target file names. For each file you invoke an R instance, download it, process the file and move on to the next.

Caveat: I didn't check or change your function against any other download files and formats. I would turn off the printing of the output by removing the print() wrapper on line 62.

print( cbind(c(rep(v1, length(v2))), v2))

See: runtime.txt for print out.
The output .csv files are placed in the ./data directory.

Note: This is a derivative of a previous answer provided by me on this subject: R memory not released in Windows. I hope it helps by way of example.

Launch Script

  1 #!/usr/local/bin/bash -x
  2
  3 R --no-save -q --slave < ./47162861.R --args "https://www.nlm.nih.gov/databases/dtd" "medsamp2015.xml"

R File - `47162861.R`

# Set working directory

projectDir <- "~/dev/stackoverflow/47162861"
setwd(projectDir)

# -----------------------------------------------------------------------------
# Load required Packages...
requiredPackages <- c("XML")

ipak <- function(pkg) {
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg))
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE)
}

ipak(requiredPackages)

# -----------------------------------------------------------------------------
# Load required Files
# trailingOnly=TRUE means that only your arguments are returned
args <- commandArgs(trailingOnly = TRUE)

if ( length(args) != 0 ) {
  dataDir <- file.path(projectDir,"data")
  fileUrl = args[1]
  fileName = args[2]
} else {
  dataDir <- file.path(projectDir,"data")
  fileUrl <- "https://www.nlm.nih.gov/databases/dtd"
  fileName <- "medsamp2015.xml"
}

# -----------------------------------------------------------------------------
# Download file

# Does the directory Exist? If it does'nt create it
if (!file.exists(dataDir)) {
  dir.create(dataDir)
}

# Now we check if we have downloaded the data already if not we download it

if (!file.exists(file.path(dataDir, fileName))) {
  download.file(fileUrl, file.path(dataDir, fileName), method = "wget")
}

# -----------------------------------------------------------------------------
# Now we extrat the data

tempdat <- data.frame(pmid = as.numeric(), lname = character(),
  stringsAsFactors = FALSE)
cnt <- 1

branchFunction <- function() {
  func <- function(x, ...) {
    v1 <- xpathSApply(x, path = "//PMID", xmlValue)
    v2 <- xpathSApply(x, path = "//Author/LastName", xmlValue)
    print(cbind(c(rep(v1, length(v2))), v2))

    # below is where I store/write the temp data along the way
    # but even without doing this, memory is used (even after
    # clearing)

    tempdat <<- rbind(tempdat, cbind(c(rep(v1, length(v2))),
      v2))
    if (nrow(tempdat) > 1000) {
      outname <- file.path(dataDir, paste0(cnt, ".csv")) # Create FileName
      write.csv(tempdat, outname, row.names = F) # Write File to created directory
      tempdat <<- data.frame(pmid = as.numeric(), lname = character(),
        stringsAsFactors = FALSE)
      cnt <<- cnt + 1
    }
  }
  list(MedlineCitation = func)
}

myfunctions <- branchFunction()

# -----------------------------------------------------------------------------
# RUN
xmlEventParse(file = file.path(dataDir, fileName),
              handlers = NULL,
              branches = myfunctions)

Test File and output

~/dev/stackoverflow/47162861/data/medsamp2015.xml

$ ll                                                            
total 2128
drwxr-xr-x@ 7 hidden  staff   238B Nov 10 11:05 .
drwxr-xr-x@ 9 hidden  staff   306B Nov 10 11:11 ..
-rw-r--r--@ 1 hidden  staff    32K Nov 10 11:12 1.csv
-rw-r--r--@ 1 hidden  staff    20K Nov 10 11:12 2.csv
-rw-r--r--@ 1 hidden  staff    23K Nov 10 11:12 3.csv
-rw-r--r--@ 1 hidden  staff    37K Nov 10 11:12 4.csv
-rw-r--r--@ 1 hidden  staff   942K Nov 10 11:05 medsamp2015.xml

Runtime Output

> ./invoke.sh > runtime.txt
+ R --no-save -q --slave --args https://www.nlm.nih.gov/databases/dtd medsamp2015.xml
Loading required package: XML

File: runtime.txt

R: Memory Management during xmlEventParse of Huge (>20GB) files

Launch Script

R File - 47162861.R

Test File and output

Runtime Output

R File - `47162861.R`