Search code examples
azureapache-sparkjupyter-notebook

PySpark3 - Reading XML files


I'm trying to read an XML file in my PySpark3 Jyupter notebook (running in Azure).

I have this code:

df = spark.read.load("wasb:///data/test/Sample Data.xml")

However I keep getting the error java.io.IOException: Could not read footer for file:

An error occurred while calling o616.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 43, wn2-xxxx.cloudapp.net, executor 2): java.io.IOException: Could not read footer for file: FileStatus{path=wasb://xxxx.blob.core.windows.net/data/test/Sample Data.xml; isDirectory=false; length=6947; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

I know its reaching the file - from looking at the length - matches the xml file size - but stuck after that?

Any ideas?

Thanks.


Solution

  • Please refer to the two blogs below, I think they can answer your question completely.

    1. Azure Blob Storage with Pyspark
    2. Reading JSON, CSV and XML files efficiently in Apache Spark

    The code is like as below.

    session = SparkSession.builder.getOrCreate()
    
    session.conf.set(
        "fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
        "<your-storage-account-access-key>"
    )
    # OR SAS token for a container:
    # session.conf.set(
    #    "fs.azure.sas.<container-name>.blob.core.windows.net",
    #    "<sas-token>"
    # )
    
    # your Sample Data.xml file in the virtual directory `data/test`
    df = session.read.format("com.databricks.spark.xml") \
        .options(rowTag="book").load("wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/data/test/") 
    

    If you were using Azure Databricks, I think the code will works as expected. otherwise, may you need to install the com.databricks.spark.xml library in your Apache Spark cluster.

    Hope it helps.