Search code examples
pythonamazon-web-servicesapache-sparkpysparkaws-glue

How can I convert a Binary that is contained in a Spark column as a StringType to a UUID string using Python?


We are building a piece with AWS Glue that will Unload a table in Aurora RDS to a CSV File that must later be converted to Parquet.

To unload, we are using SELECT * FROM TableA INTO OUTFILE S3 's3_path' FORMAT CSV HEADER According to the documentation at: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.SaveIntoS3.html

Before turning this CSV into Parquet, all columns that start with "cod_idef_" are always Binary and must be converted to UUID. However, when reading the CSV file with Spark, it infers the columns as StringType.

How would it be possible to convert these columns to a UUID string?

I've tried:

def binary_to_uuid(binary_val):
    if binary_val:
        return str(uuid.UUID(bytes=bytearray(binary_val)))
    else:
        return None
columns_to_convert = [col_name for col_name in df.columns if col_name.startswith("cod_idef")]

# Apply the UDF to each column
for col_name in columns_to_convert:
    df = df.withColumn(col_name, udf(binary_to_uuid, StringType())(col(col_name)))

Edit 1: Here is an example, as requested by user238607 The binary: enter image description here

The result should be: 00000000-8a3a-46b2-84c2-a227836ea168


Solution

  • CSV (Comma-Separated Values) is a text-based format, and it does not support the direct representation of binary data. Therefore, it is expected that your file has a StringType for that column. You need to check how the binary column looks in the exported file on S3. If it's a hex representation of binary data (e.g. 000000008a3a46b284c2a227836ea168), then you just need to convert it from hex to a UUID:

    # Define a UDF to parse the UUIDs from the hexadecimal strings
    @udf(returnType=StringType())
    def parse_uuid(hex_string):
        return str(uuid.UUID(hex_string))
    
    # Load the CSV file into a DataFrame
    df = spark.read.csv(csv_file, header=True, inferSchema=True)
    
    # Apply the UDF to the 'uuid_binary' column to parse the UUIDs
    df = df.withColumn("uuid", parse_uuid(df["uuid_binary"]))