Search code examples
xmldataframepysparkxpath

How to get nested xml structure as a string from an xml document using xpath in pyspark dataframe?


I have a dataframe with a string datatype column with XML string. Now I want to create a new column with a nested XML structure from the original column. For this, I tried using XPath in PySpark.

Suppose I have this XML in a string format:

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <description>
      <tag1>Example text 1</tag1>
      <tag2>Example text 2</tag2>
      <tag4>Example text 3</tag4>
    </description>
    <year>2005</year>
    <price>30.00</price>
  </book>
</bookstore>

Now, I want to get the nested XML structure from the <description> tag in a string in another column, i.e.,

<description>
  <tag1>Example text 1</tag1>
  <tag2>Example text 2</tag2>
  <tag4>Example text 3</tag4>
</description>

as it is.

For example, the original column is original_content with:

"<?xml version="1.0" encoding="UTF-8"?><bookstore><book category="cooking"><title lang="en">Everyday Italian</title><author>Giada De Laurentiis</author><description><tag1>Example text 1</tag1><tag2>Example text 2</tag2><tag4>Example text 3</tag4></description><year>2005</year><price>30.00</price></book></bookstore>"

Now I want to add a new column nested_content to the dataframe with .withColumn and the content should be:

"<description><tag1>Example text 1</tag1><tag2>Example text 2</tag2><tag4>Example text 3</tag4></description>"

For example, in XPath:

df = df.withColumn("new", xpath(col("content"), lit('/bookstore/book/description/*')))

Neither * nor text() is working, giving NULL values.


Solution

  • SparkSQL's xpath function is to remove all XML tags and convert texts into a data structure like array, number, string etc. For your example:

    df.selectExpr("""xpath(content, '/bookstore/book/description/*/text()') as new""").show(truncate=False)
    +------------------------------------------------+
    |new                                             |
    +------------------------------------------------+
    |[Example text 1, Example text 2, Example text 3]|
    +------------------------------------------------+
    

    If you are looking for a snippet of your XML text(keeping XML tags), for your example, you can try regexp_extract function:

    df.selectExpr("regexp_extract(content, '(?s:<description>.*?</description>)',0) AS new")
    +------------------------------------------------------------------------------------------------------------+
    |new                                                                                                         |
    +------------------------------------------------------------------------------------------------------------+
    |<description><tag1>Example text 1</tag1><tag2>Example text 2</tag2><tag4>Example text 3</tag4></description>|
    +------------------------------------------------------------------------------------------------------------+