I'm trying to read XML file with pySpark. The problem is, it produces empty dataframe. I was able to read different files without any issues.
Here is the XML structure:
<?xml version="1.0"?>
<TAG1>
<TAG2 id="ID1">
<field1>some value</field1>
<field2>some value</field2>
<field3></field3>
</TAG2>
<TAG2 id="ID2">
<field1>some value</field1>
<field2></field2>
<field3></field3>
</TAG2>
<TAG2 id="ID3">
<field1>some value</field1>
<field2>some value</field2>
<field3>some value</field3>
</TAG2>
</TAG1>
I tried different options when reading the file but none of them works.
df = spark.read \
.format('com.databricks.spark.xml') \
.options(rowTag='TAG2') \
.options(nullValue='') \
.options(emptyValue='') \
.load(xmlFile)
UPDATE:
I copied the file contents to a new file, saved it and it loaded without any problems. It seems like the original file is causing problems. I tried to put various encodings in read options but still was not able to load the original file.
File happened to be encoded in UTF-16, but setting this encoding didn't solve the issue. I managed to load the file only after passing another parameter - charset.
df = spark.read \
.format('com.databricks.spark.xml') \
.options(rowTag='TAG2') \
.options(nullValue='') \
.options(encoding='UTF-16LE') \
.options(charset='UTF-16') \
.load(xmlFile)