I am trying to parse XML and extract node value within XML using Spark-SQL ,
Consider below XML
<Account><ID1><other><ID>123</ID></other></ID1></Account>
Here I am trying to extract 123 using XPath but not getting expected Result.
I appreciated any help here.
to parse XML and extract node values in azure databricks using spark sql, you can use xpath_string
function. example:
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder \
.appName("XML Parsing with Spark SQL") \
.getOrCreate()
# Sample data with XML content
data = [
('<Account><ID1><other><ID>123</ID></other></ID1></Account>',),
('<Account><ID1><other><ID>456</ID></other></ID1></Account>',)
]
# Define column names
columns = ['xml_data']
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show the DataFrame content
df.show(truncate=False)
# Register the DataFrame as a temp view
df.createOrReplaceTempView("xml_table")
# Execute SQL query to extract values using XPath
result = spark.sql("""
SELECT
xpath_string(xml_data, 'Account/ID1/other/ID/text()') AS extracted_id
FROM
xml_table
""")
# Show the results
result.show(truncate=False)
# Stop the Spark session
spark.stop()
output:
extracted_id |
---|
123 |
456 |