I have a PySpark dataframe which contains a column containing Bytes in a nested dictionary so the data is look like this: Col_name: "{"bytes":"\u0014ok\u0000"} and so on, the logical type of this field is DECIMAL so it should return a decimal value, but I need first to cast it as Binary and when I cast using the following code the extracted value is incorrect, anyone can help on this? thanks
df = df.withColumn("col_name", col("col_name").cast("binary"))
here is my solution.
from pyspark.sql.functions import *
from pyspark.sql.types import MapType, StringType, IntegerType
jsonString="""{"bytes":"\\u0014\\u0000"}"""
df = spark.createDataFrame(data=[(1, jsonString)],schema=["id","value"])
df.show(truncate = False)
def convertColumn(str):
#convert you integer here
return 10
parseInt = udf(lambda z: convertColumn(z), IntegerType())
res = df.withColumn("parsed", parseInt(element_at(from_json(col("value"), "MAP<STRING,STRING>"), "bytes")))
res.show(truncate = False)
Output:
+---+------------------------+
|id |value |
+---+------------------------+
|1 |{"bytes":"\u0014\u0000"}|
+---+------------------------+
+---+------------------------+------+
|id |value |parsed|
+---+------------------------+------+
|1 |{"bytes":"\u0014\u0000"}|10 |
+---+------------------------+------+