Building on this previous question (see here), I am attempting to read in many, large xml files via xmlEventParse whilst saving node-varying data. Working with this sample xml: https://www.nlm.nih.gov/databases/dtd/medsamp2015.xml.
The code below uses xpathSapply to extract the necessary values and a series of if statements to combine the values in a way that matches the unique value (PMID) to each of the non-unique values (LastName) within a record - for which there may be no LastNames. The goal is to write a series of small csv's along the way (here, after every 1000 LastNames) to minimize the amount of memory used.
When run on the full-sized data set, the code successfully outputs files in batches, however something is still being stored in memory that eventually causes a system error once all RAM is used. I've watched the task manager while the code runs and can see R's memory grow as the program progresses. And if I stop the program mid-run and then clear the R workspace, including hidden items, the memory still appears to be in use by R. It is not until I shutdown R is the memory freed up again.
Run this a few times yourself and you'll see R's memory usage grow even after clearing the workspace.
Please help! This problem appears to be common to others reading in large XML files in this manner (See for example comments in this question).
My code is as follows:
library(XML)
filename <- "~/Desktop/medsamp2015.xml"
tempdat <- data.frame(pmid=as.numeric(),
lname=character(),
stringsAsFactors=FALSE)
cnt <- 1
branchFunction <- function() {
func <- function(x, ...) {
v1 <- xpathSApply(x, path = "//PMID", xmlValue)
v2 <- xpathSApply(x, path = "//Author/LastName", xmlValue)
print(cbind(c(rep(v1,length(v2))), v2))
#below is where I store/write the temp data along the way
#but even without doing this, memory is used (even after clearing)
tempdat <<- rbind(tempdat,cbind(c(rep(v1,length(v2))), v2))
if (nrow(tempdat) > 1000){
outname <- paste0("~/Desktop/outfiles",cnt,".csv")
write.csv(tempdat, outname , row.names = F)
tempdat <<- data.frame(pmid=as.numeric(),
lname=character(),
stringsAsFactors=FALSE)
cnt <<- cnt+1
}
}
list(MedlineCitation = func)
}
myfunctions <- branchFunction()
#RUN
xmlEventParse(
file = filename,
handlers = NULL,
branches = myfunctions
)
Here is an example, we have a launch script invoke.sh
, that calls an R Script and passes the url and filename as parameters... In this case, I had previously downloaded the test file medsamp2015.xml and put in the ./data
directory.
invoke.sh
script and iterate through the list of target file names. For each file you invoke an R instance, download it, process the file and move on to the next. Caveat: I didn't check or change your function against any other download files and formats. I would turn off the printing of the output by removing the print() wrapper on line 62.
print( cbind(c(rep(v1, length(v2))), v2))
.csv
files are placed in the ./data
directory.Note: This is a derivative of a previous answer provided by me on this subject: R memory not released in Windows. I hope it helps by way of example.
1 #!/usr/local/bin/bash -x
2
3 R --no-save -q --slave < ./47162861.R --args "https://www.nlm.nih.gov/databases/dtd" "medsamp2015.xml"
47162861.R
# Set working directory
projectDir <- "~/dev/stackoverflow/47162861"
setwd(projectDir)
# -----------------------------------------------------------------------------
# Load required Packages...
requiredPackages <- c("XML")
ipak <- function(pkg) {
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
ipak(requiredPackages)
# -----------------------------------------------------------------------------
# Load required Files
# trailingOnly=TRUE means that only your arguments are returned
args <- commandArgs(trailingOnly = TRUE)
if ( length(args) != 0 ) {
dataDir <- file.path(projectDir,"data")
fileUrl = args[1]
fileName = args[2]
} else {
dataDir <- file.path(projectDir,"data")
fileUrl <- "https://www.nlm.nih.gov/databases/dtd"
fileName <- "medsamp2015.xml"
}
# -----------------------------------------------------------------------------
# Download file
# Does the directory Exist? If it does'nt create it
if (!file.exists(dataDir)) {
dir.create(dataDir)
}
# Now we check if we have downloaded the data already if not we download it
if (!file.exists(file.path(dataDir, fileName))) {
download.file(fileUrl, file.path(dataDir, fileName), method = "wget")
}
# -----------------------------------------------------------------------------
# Now we extrat the data
tempdat <- data.frame(pmid = as.numeric(), lname = character(),
stringsAsFactors = FALSE)
cnt <- 1
branchFunction <- function() {
func <- function(x, ...) {
v1 <- xpathSApply(x, path = "//PMID", xmlValue)
v2 <- xpathSApply(x, path = "//Author/LastName", xmlValue)
print(cbind(c(rep(v1, length(v2))), v2))
# below is where I store/write the temp data along the way
# but even without doing this, memory is used (even after
# clearing)
tempdat <<- rbind(tempdat, cbind(c(rep(v1, length(v2))),
v2))
if (nrow(tempdat) > 1000) {
outname <- file.path(dataDir, paste0(cnt, ".csv")) # Create FileName
write.csv(tempdat, outname, row.names = F) # Write File to created directory
tempdat <<- data.frame(pmid = as.numeric(), lname = character(),
stringsAsFactors = FALSE)
cnt <<- cnt + 1
}
}
list(MedlineCitation = func)
}
myfunctions <- branchFunction()
# -----------------------------------------------------------------------------
# RUN
xmlEventParse(file = file.path(dataDir, fileName),
handlers = NULL,
branches = myfunctions)
~/dev/stackoverflow/47162861/data/medsamp2015.xml
$ ll
total 2128
drwxr-xr-x@ 7 hidden staff 238B Nov 10 11:05 .
drwxr-xr-x@ 9 hidden staff 306B Nov 10 11:11 ..
-rw-r--r--@ 1 hidden staff 32K Nov 10 11:12 1.csv
-rw-r--r--@ 1 hidden staff 20K Nov 10 11:12 2.csv
-rw-r--r--@ 1 hidden staff 23K Nov 10 11:12 3.csv
-rw-r--r--@ 1 hidden staff 37K Nov 10 11:12 4.csv
-rw-r--r--@ 1 hidden staff 942K Nov 10 11:05 medsamp2015.xml
> ./invoke.sh > runtime.txt
+ R --no-save -q --slave --args https://www.nlm.nih.gov/databases/dtd medsamp2015.xml
Loading required package: XML
File: runtime.txt