I have a case where I have below data frame
`scala> res1.printSchema
root
|-- REC: binary (nullable = true)
scala> res1.show(1,false)
+----------------------+
|REC |
+----------------------+
|[75 00 01 00 4C 12 10]|
+----------------------+`
Now my requirement is to get a string of "75 00 01 00 4C 12 10" from the binary type. Please help.
I tried get mkString(" ") but it seem to be converting to standard asci but I want the literally binaries as a string as "75 00 01 00 4C 12 10"
Possibly you just want hex, but if you really want the spaces then:
val df = sparkSession.sql("select cast('i am text' as binary) bytes")
df.show
val castedToString = df.selectExpr("cast(bytes as string) casted")
castedToString.show
val hexed = df.selectExpr("hex(bytes) hexString")
hexed.show
val prettyString = hexed.selectExpr("rtrim(regexp_replace(hexString,'(.{2})', '$1 ')) perToPrettyString")
prettyString.show
yields:
+--------------------+
| bytes|
+--------------------+
|[69 20 61 6D 20 7...|
+--------------------+
+---------+
| casted|
+---------+
|i am text|
+---------+
+------------------+
| hexString|
+------------------+
|6920616D2074657874|
+------------------+
+--------------------+
| perToPrettyString|
+--------------------+
|69 20 61 6D 20 74...|
+--------------------+
The last expression replaces every two characters by themselves and an additional space, then removes the last trailing space.
Details:
When Spark performs ".show" each column is forced through a new internal expression ToPrettyString which, by the inherited from ToStringBase default, translates binary into hex (via SparkStringUtils.getHexString), wrapping in square brackets.
bytes.map("%02X".format(_)).mkString("[", " ", "]")
Cast overrides this default:
override protected def useHexFormatForBinary: Boolean = false
The hex sql function calls Hex.hex:
def hex(bytes: Array[Byte]): UTF8String = {
val length = bytes.length
val value = new Array[Byte](length * 2)
var i = 0
while (i < length) {
value(i * 2) = Hex.hexDigits((bytes(i) & 0xF0) >> 4)
value(i * 2 + 1) = Hex.hexDigits(bytes(i) & 0x0F)
i += 1
}
UTF8String.fromBytes(value)
}
You could call also this function with a udf and probably should if performance is important instead of using a regex.