Search code examples
excelxmlazure-synapseazure-data-lake

Analyze Excel and XML file using Azure Synapse


I have excel and XML files stored on Azure Data Lake. I would like to analyze them using Azure Synapse i.e. Use Serverless SQL to view data, create view etc.

I can analyze CSV file but it doesn't seem to support Excel or CSV file.

Is there a way I can analyze the data using Azure Synapse?


Solution

  • Azure Synapse Analytics, dedicated SQL pools & Serverless pools does not support the XML or any of the functions accompanying it including FOR XML.

    In the below image Azure synapse is not listed.

    enter image description here

    A simple approach would be to use Azure SQL Database for processing your XML data. For example, you can use a pipeline to load your data into Azure SQL Database, then call a stored procedure to handle the XML processing, and finally retrieve the results.

    Alternatively, you can use a Synapse notebook with Python or Scala to process the XML directly.

    I have tried the below approach:

    from pyspark.sql import SparkSession
    appName = "Exampleapp "
    master = "Dilip"
    spark = SparkSession.builder \
        .appName(appName) \
        .master(master) \
        .getOrCreate()
    spark.sparkContext.setLogLevel('WARN')
    Dilip_samplexml_data = [
        {'id': 1, 'data': """<test a="100" b="200">
            <records>
                <record id="101" />
                <record id="201" />
            </records>
        </test>
    """},
        {'id': 2, 'data': """<test a="200" b="400">
            <records>
                <record id="202" />
                <record id="402" />
            </records>
        </test>
    """}]
    df = spark.createDataFrame(Dilip_samplexml_data)
    print(df.schema)
    df.show()
    

    Results:

    StructType([StructField('data', StringType(), True), StructField('id', LongType(), True)])
    +--------------------+---+
    |                data| id|
    +--------------------+---+
    |<test a="100" b="...|  1|
    |<test a="200" b="...|  2|
    +--------------------+---+
    

    You can use UDF to extract value

    from pyspark.sql.functions import udf
    from pyspark.sql.types import ArrayType, StringType
    import xml.etree.ElementTree as ET
    @udf
    def extract_ab(xml):
        doc = ET.fromstring(xml)
        return [doc.attrib['a'], doc.attrib['b']]
    df = df.withColumn('ab', extract_ab(df['data']))
    df.show()
    

    Results:

    +--------------------+---+----------+
    |                data| id|        ab|
    +--------------------+---+----------+
    |<test a="100" b="...|  1|[100, 200]|
    |<test a="200" b="...|  2|[200, 400]|
    +--------------------+---+----------+
    

    In the above code I am usig the xml.etree.ElementTree to extract values. next using udf decorator to mark the Python function as a UDF. import ElementTree and we can use the UDF. If you want to extract as ARRAY

    you can use the below UDF:

    from pyspark.sql.types import IntegerType
    def extract_rid(xml):
        doc = ET.fromstring(xml)
        records = doc.findall('records/record')
        ids = []
        for r in records:
            ids.append(int(r.attrib["id"]))
        return ids
    schema = ArrayType(IntegerType())
    udf_extract_rid = udf(extract_rid, schema)
    df = df.withColumn('rids', udf_extract_rid(df["data"]))
    print(df.schema)
    df.show()
    
    StructType([StructField('data', StringType(), True), StructField('id', LongType(), True), StructField('ab', StringType(), True), StructField('rids', ArrayType(IntegerType(), True), True)])
    +--------------------+---+----------+----------+
    |                data| id|        ab|      rids|
    +--------------------+---+----------+----------+
    |<test a="100" b="...|  1|[100, 200]|[101, 201]|
    |<test a="200" b="...|  2|[200, 400]|[202, 402]|
    +--------------------+---+----------+----------+
    

    We can use the built-in explode function to expand the array column into individual rows:

    from pyspark.sql.functions import explode
    df.withColumn('rid', explode(df['rids'])).show()
    

    Results:

    +--------------------+---+----------+----------+---+
    |                data| id|        ab|      rids|rid|
    +--------------------+---+----------+----------+---+
    |<test a="100" b="...|  1|[100, 200]|[101, 201]|101|
    |<test a="100" b="...|  1|[100, 200]|[101, 201]|201|
    |<test a="200" b="...|  2|[200, 400]|[202, 402]|202|
    |<test a="200" b="...|  2|[200, 400]|[202, 402]|402|
    +--------------------+---+----------+----------+---+
    

    If you want to read an Excel file from ADLS Gen2 using Spark, you need to include the com.crealytics.spark.excel package from the Maven repository.

    Next, To upload the downloaded JAR file to Synapse Studio as a workspace package:

    Go to Manage>Navigate to Workspace packages>Click Upload. Select and upload the downloaded JAR file.

    To install the package on the Apache Spark pool you're using for the notebook:

    • Select the Spark pool from the workspace.
    • Go to Packages
    • Choose Select from workspace packages.
    • Select the Excel package and click Save.

    To read an Excel file from a notebook, use the following code:

    filePath = "abfss://<containername>@<storageaccountname>.dfs.core.windows.net/<sample-excel-file>"
    sparkDF = spark.read.format("com.crealytics.spark.excel") \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .load(filePath)