Search code examples
scalaapache-sparkbinaryhex

spark binary (byte array) to get bytes as string


I have a case where I have below data frame

`scala> res1.printSchema
root
 |-- REC: binary (nullable = true)


scala> res1.show(1,false)
+----------------------+
|REC                                                                                                                                                                                                                                                                                                                                                                                                    |
+----------------------+
|[75 00 01 00 4C 12 10]|
+----------------------+`

Now my requirement is to get a string of "75 00 01 00 4C 12 10" from the binary type. Please help.

I tried get mkString(" ") but it seem to be converting to standard asci but I want the literally binaries as a string as "75 00 01 00 4C 12 10"


Solution

  • Possibly you just want hex, but if you really want the spaces then:

    val df = sparkSession.sql("select cast('i am text' as binary) bytes")
    df.show
    val castedToString = df.selectExpr("cast(bytes as string) casted")
    castedToString.show
    val hexed = df.selectExpr("hex(bytes) hexString")
    hexed.show
    val prettyString = hexed.selectExpr("rtrim(regexp_replace(hexString,'(.{2})', '$1 ')) perToPrettyString")
    prettyString.show
    

    yields:

    +--------------------+
    |               bytes|
    +--------------------+
    |[69 20 61 6D 20 7...|
    +--------------------+
    
    +---------+
    |   casted|
    +---------+
    |i am text|
    +---------+
    
    +------------------+
    |         hexString|
    +------------------+
    |6920616D2074657874|
    +------------------+
    
    +--------------------+
    |   perToPrettyString|
    +--------------------+
    |69 20 61 6D 20 74...|
    +--------------------+
    

    The last expression replaces every two characters by themselves and an additional space, then removes the last trailing space.

    Details:

    When Spark performs ".show" each column is forced through a new internal expression ToPrettyString which, by the inherited from ToStringBase default, translates binary into hex (via SparkStringUtils.getHexString), wrapping in square brackets.

    bytes.map("%02X".format(_)).mkString("[", " ", "]")
    

    Cast overrides this default:

      override protected def useHexFormatForBinary: Boolean = false
    

    The hex sql function calls Hex.hex:

      def hex(bytes: Array[Byte]): UTF8String = {
        val length = bytes.length
        val value = new Array[Byte](length * 2)
        var i = 0
        while (i < length) {
          value(i * 2) = Hex.hexDigits((bytes(i) & 0xF0) >> 4)
          value(i * 2 + 1) = Hex.hexDigits(bytes(i) & 0x0F)
          i += 1
        }
        UTF8String.fromBytes(value)
      }
    

    You could call also this function with a udf and probably should if performance is important instead of using a regex.