I am trying to read a file with an encoding of UTF-16 in a Spark dataframe. However, when I display the result of my dataframe, I get unwanted special characters in my result-set.
I have tried the following - using UTF-16BE:
df = spark.read.format('text').option("encoding", 'UTF-16BE').option("charset", 'UTF-16').option('delimiter', "\|").option('header', 'false').option("quote", "\"").option("escape", "\"").load(filepath)
And have tried the following using UTF-16LE:
df = spark.read.format('text').option("encoding", 'UTF-16LE').option("charset", 'UTF-16').option('delimiter', "\|").option('header', 'false').option("quote", "\"").option("escape", "\"").load(filepath)
Both attempts return the unwanted special characters.
I would really appreciate any help.
I think you want to use csv
instead of text
, as text
format doesn't support the options you're trying to use. Try :
df = spark.read.format("csv")\
.option("encoding", "UTF-16")\
.option('sep', "|")\
.option("escape", "\"")\
.load(filepath)
Note that the default value for option header
is false so no need to specify it, same for quote
. Also, there is no option charset
, you only need the encoding
option.
You can find all available options for each data source here: DataFrameReader