I have a dataframe with a string datatype column with XML string. Now I want to create a new column with a nested XML structure from the original column. For this, I tried using XPath in PySpark.
Suppose I have this XML in a string format:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<description>
<tag1>Example text 1</tag1>
<tag2>Example text 2</tag2>
<tag4>Example text 3</tag4>
</description>
<year>2005</year>
<price>30.00</price>
</book>
</bookstore>
Now, I want to get the nested XML structure from the <description>
tag in a string in another column, i.e.,
<description>
<tag1>Example text 1</tag1>
<tag2>Example text 2</tag2>
<tag4>Example text 3</tag4>
</description>
as it is.
For example, the original column is original_content
with:
"<?xml version="1.0" encoding="UTF-8"?><bookstore><book category="cooking"><title lang="en">Everyday Italian</title><author>Giada De Laurentiis</author><description><tag1>Example text 1</tag1><tag2>Example text 2</tag2><tag4>Example text 3</tag4></description><year>2005</year><price>30.00</price></book></bookstore>"
Now I want to add a new column nested_content
to the dataframe with .withColumn
and the content should be:
"<description><tag1>Example text 1</tag1><tag2>Example text 2</tag2><tag4>Example text 3</tag4></description>"
For example, in XPath:
df = df.withColumn("new", xpath(col("content"), lit('/bookstore/book/description/*')))
Neither *
nor text()
is working, giving NULL
values.
SparkSQL's xpath function is to remove all XML tags and convert texts into a data structure like array, number, string etc. For your example:
df.selectExpr("""xpath(content, '/bookstore/book/description/*/text()') as new""").show(truncate=False)
+------------------------------------------------+
|new |
+------------------------------------------------+
|[Example text 1, Example text 2, Example text 3]|
+------------------------------------------------+
If you are looking for a snippet of your XML text(keeping XML tags), for your example, you can try regexp_extract function:
df.selectExpr("regexp_extract(content, '(?s:<description>.*?</description>)',0) AS new")
+------------------------------------------------------------------------------------------------------------+
|new |
+------------------------------------------------------------------------------------------------------------+
|<description><tag1>Example text 1</tag1><tag2>Example text 2</tag2><tag4>Example text 3</tag4></description>|
+------------------------------------------------------------------------------------------------------------+