xml apache-spark pyspark databricks azure-databricks

pyspark XML read producing empty dataframe

I'm trying to read XML file with pySpark. The problem is, it produces empty dataframe. I was able to read different files without any issues.

Here is the XML structure:

<?xml version="1.0"?>
<TAG1>
  <TAG2 id="ID1"> 
      <field1>some value</field1>
      <field2>some value</field2>
      <field3></field3>
  </TAG2>
  <TAG2 id="ID2"> 
      <field1>some value</field1>
      <field2></field2>
      <field3></field3>
  </TAG2>
  <TAG2 id="ID3"> 
      <field1>some value</field1>
      <field2>some value</field2>
      <field3>some value</field3>
  </TAG2>
</TAG1>

I tried different options when reading the file but none of them works.

df = spark.read \
    .format('com.databricks.spark.xml') \
    .options(rowTag='TAG2') \
    .options(nullValue='') \
    .options(emptyValue='') \
    .load(xmlFile)

UPDATE:

I copied the file contents to a new file, saved it and it loaded without any problems. It seems like the original file is causing problems. I tried to put various encodings in read options but still was not able to load the original file.

Solution

File happened to be encoded in UTF-16, but setting this encoding didn't solve the issue. I managed to load the file only after passing another parameter - charset.

df = spark.read \
    .format('com.databricks.spark.xml') \
    .options(rowTag='TAG2') \
    .options(nullValue='') \
    .options(encoding='UTF-16LE') \
    .options(charset='UTF-16') \
    .load(xmlFile)