I am trying to read a dataframe from a parquet file with Spark in python but my dataframe is byte encoded so when I use spark.read.parquet and then df.show() it looks like the following:
+---+----------+----+
| C1| C2| C3|
+---+----------+----+
| 1|[20 2D 2D]| 0|
| 2|[32 30 31]| 0|
| 3|[43 6F 6D]| 0|
+---+----------+----+
As you can see it the values are converted to hexadecimal values... I've read the entire documentation of spark dataframes but I did not found anything. Is it possible to convert to UTF-8?
The df.printSchema() output:
|-- C1: long (nullable = true)
|-- C2: binary (nullable = true)
|-- C3: long (nullable = true)
The Spark version is 2.4.4
Thank you!
You have a binary
type column, which is like a bytearray
in python. You just need to cast to string:
df = df.withColumn("C2", df["C2"].cast("string"))
df.show()
#+---+---+---+
#| C1| C2| C3|
#+---+---+---+
#| 1| --| 0|
#| 2|201| 0|
#| 3|Com| 0|
#+---+---+---+
Likewise in python:
str(bytearray([0x20, 0x2D, 0x2D]))
#' --'