apache-spark pyspark databricks apache-spark-xml

Use recursive globbing to extract XML documents as strings in pyspark

The goal is to extract XML documents, given an XPath expression, from a group of text files as strings. The difficulty is the variance of forms the text files may be in. Might be:

single zip / tar file with 100 files, each 1 XML document
one file, with 100 XML documents (aggregate document)
one zip / tar file, with varying levels of directories, with single XML records as files and aggregate XML files

I thought I had found a solution with Databrick's Spark Spark-XML library, as it handles recursive globbing when reading files. It was amazing. Could do things like:

# read directory of loose files
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/qs/mods/*.xml')

# recursively discover and parse
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/qs/**/*.xml')

# even read archive files without additional work
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/mods_archive.tar')

The problem, this library is focused on parsing the XML records into DataFrame columns, where my goal is retrieve just the XML documents as strings for storage.

My scala is not strong enough to easily hack at the Spark-XML library to utilize the recursive globbing and XPath grabbing of documents, but skipping the parsing and instead save the entire XML record as a string.

The library comes with the ability to serialize DataFrames to XML, but the serialization is decidely different than the input (which is to be expected to some degree). For example, element text values become element attributes. Given the following original XML:

<mods:role>
    <mods:roleTerm authority="marcrelator" type="text">creator</mods:roleTerm>
</mods:role>

reading then serializing witih Spark-XML returns:

<mods:role>
    <mods:roleTerm VALUE="creator" authority="marcrelator" type="text"></mods:roleTerm>
</mods:role>

However, even if I could get the VALUE to be serialized as an actual element value, I'm still not acheiving my end goal of having these XML documents that were discovered and read via Spark-XML's excellent globbing and XPath selection, as just strings.

Any insight would be appreciated.

Solution

Found a solution from this Databricks Spark-XML issue:

xml_rdd = sc.newAPIHadoopFile('file:///tmp/mods/*.xml','com.databricks.spark.xml.XmlInputFormat','org.apache.hadoop.io.LongWritable','org.apache.hadoop.io.Text',conf={'xmlinput.start':'<mods:mods>','xmlinput.end':'</mods:mods>','xmlinput.encoding': 'utf-8'})

Expecting 250 records, and got 250 records. Simple RDD with entire XML record as a string:

In [8]: xml_rdd.first()
Out[8]: 
(4994,
 '<mods:mods xmlns:mets="http://www.loc.gov/METS/" xmlns:xl="http://www.w3.org/1999/xlink" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/" version="3.0">\n\n\n               <mods:titleInfo>\n\n\n                  <mods:title>Jessie</mods:title>\n\n\n...
...
...

Credit to the Spark-XML maintainer(s) for a wonderful library, and attentiveness to issues.