Search code examples
pythonpyspark

Bytes values in pySpark Dataframe


I have a PySpark dataframe which contains a column containing Bytes in a nested dictionary so the data is look like this: Col_name: "{"bytes":"\u0014ok\u0000"} and so on, the logical type of this field is DECIMAL so it should return a decimal value, but I need first to cast it as Binary and when I cast using the following code the extracted value is incorrect, anyone can help on this? thanks

df = df.withColumn("col_name", col("col_name").cast("binary"))

Solution

  • here is my solution.

    from pyspark.sql.functions import *
    from pyspark.sql.types import MapType, StringType, IntegerType
    
    jsonString="""{"bytes":"\\u0014\\u0000"}"""
    df = spark.createDataFrame(data=[(1, jsonString)],schema=["id","value"])
    
    df.show(truncate = False)
    def convertColumn(str):
        #convert you integer here
        return 10
    
    parseInt = udf(lambda z: convertColumn(z), IntegerType())
    
    res = df.withColumn("parsed", parseInt(element_at(from_json(col("value"), "MAP<STRING,STRING>"), "bytes")))
    
    res.show(truncate = False)
    

    Output:

    +---+------------------------+
    |id |value                   |
    +---+------------------------+
    |1  |{"bytes":"\u0014\u0000"}|
    +---+------------------------+
    
    +---+------------------------+------+
    |id |value                   |parsed|
    +---+------------------------+------+
    |1  |{"bytes":"\u0014\u0000"}|10    |
    +---+------------------------+------+