Search code examples
python-3.xdataframeapache-sparkpysparkparquet

Parquet bytes dataframe to UTF-8 in Spark


I am trying to read a dataframe from a parquet file with Spark in python but my dataframe is byte encoded so when I use spark.read.parquet and then df.show() it looks like the following:

    +---+----------+----+
    | C1|        C2|  C3|
    +---+----------+----+
    |  1|[20 2D 2D]|   0|
    |  2|[32 30 31]|   0|
    |  3|[43 6F 6D]|   0|
    +---+----------+----+

As you can see it the values are converted to hexadecimal values... I've read the entire documentation of spark dataframes but I did not found anything. Is it possible to convert to UTF-8?

The df.printSchema() output:

 |-- C1: long (nullable = true)
 |-- C2: binary (nullable = true)
 |-- C3: long (nullable = true)

The Spark version is 2.4.4

Thank you!


Solution

  • You have a binary type column, which is like a bytearray in python. You just need to cast to string:

    df = df.withColumn("C2", df["C2"].cast("string"))
    df.show()
    #+---+---+---+
    #| C1| C2| C3|
    #+---+---+---+
    #|  1| --|  0|
    #|  2|201|  0|
    #|  3|Com|  0|
    #+---+---+---+
    

    Likewise in python:

    str(bytearray([0x20, 0x2D, 0x2D]))
    #' --'