python pyspark apache-spark-sql azure-databricks utf-16

Unable to read UTF-16 file

I am trying to read a file with an encoding of UTF-16 in a Spark dataframe. However, when I display the result of my dataframe, I get unwanted special characters in my result-set.

I have tried the following - using UTF-16BE:

df = spark.read.format('text').option("encoding", 'UTF-16BE').option("charset", 'UTF-16').option('delimiter', "\|").option('header', 'false').option("quote", "\"").option("escape", "\"").load(filepath)

And have tried the following using UTF-16LE:

df = spark.read.format('text').option("encoding", 'UTF-16LE').option("charset", 'UTF-16').option('delimiter', "\|").option('header', 'false').option("quote", "\"").option("escape", "\"").load(filepath)

Both attempts return the unwanted special characters.

I would really appreciate any help.

Solution

I think you want to use csv instead of text, as text format doesn't support the options you're trying to use. Try :

df = spark.read.format("csv")\
    .option("encoding", "UTF-16")\
    .option('sep', "|")\
    .option("escape", "\"")\
    .load(filepath)

Note that the default value for option header is false so no need to specify it, same for quote. Also, there is no option charset, you only need the encoding option.

You can find all available options for each data source here: DataFrameReader