python-3.x dataframe apache-spark pyspark parquet

Parquet bytes dataframe to UTF-8 in Spark

I am trying to read a dataframe from a parquet file with Spark in python but my dataframe is byte encoded so when I use spark.read.parquet and then df.show() it looks like the following:

    +---+----------+----+
    | C1|        C2|  C3|
    +---+----------+----+
    |  1|[20 2D 2D]|   0|
    |  2|[32 30 31]|   0|
    |  3|[43 6F 6D]|   0|
    +---+----------+----+

As you can see it the values are converted to hexadecimal values... I've read the entire documentation of spark dataframes but I did not found anything. Is it possible to convert to UTF-8?

The df.printSchema() output:

 |-- C1: long (nullable = true)
 |-- C2: binary (nullable = true)
 |-- C3: long (nullable = true)

The Spark version is 2.4.4

Thank you!

Solution

You have a binary type column, which is like a bytearray in python. You just need to cast to string:

df = df.withColumn("C2", df["C2"].cast("string"))
df.show()
#+---+---+---+
#| C1| C2| C3|
#+---+---+---+
#|  1| --|  0|
#|  2|201|  0|
#|  3|Com|  0|
#+---+---+---+

Likewise in python:

str(bytearray([0x20, 0x2D, 0x2D]))
#' --'