Search code examples
xmlpysparkdatabricksmetadataazure-databricks

Databricks problem accessing file _metadata


I'm trying to access _metadata to get file modification time using the following instructions: https://docs.databricks.com/en/ingestion/file-metadata-column.html

Here is my code:

df = spark.read \
    .format('com.databricks.spark.xml') \
    .options(rowTag='TAG2') \
    .options(nullValue='') \
    .load(xmlFile) \
    .select("*", "_metadata")

This works when I load csv file, but doesn't work with XML file. I get the error stating that there is no such column.

I am sure that the code loading XML contents works well.

Is this feature just not supported with XML files or am I doing something wrong?


Solution

  • I used slightly different approach since I decided to go with autoloader to ingest the files. I followed this example and read the files as binary and then convert them to xml https://docs.databricks.com/en/_extras/notebooks/source/kb/streaming/streaming-xml-example.html

    I could access metadata without any issues.