I am using following code to read avro in spark:
val inputData = sc.hadoopFile(inputPath,
classOf[AvroInputFormat[GenericRecord]],
classOf[AvroWrapper[GenericRecord]]).map(t =>
{ val genericRecord = t._1.datum()
(String)genericRecord.get("name") });
the loading part works fine but the converting to String part failed:
Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be cast to java.lang.String
To simplify the example, I use a line
(String)genericRecord.get("name")
Actually that part is from a library, which is used fine in a hadoop map reduce job. However, when I am using that library in spark now, it failed because of the above exception.
I know I can change the code to genericRecord.get("name").toString()
to make it work, but because I am using it fine in another hadoop mapreduce job, I am hoping all the utf8 could be automatically converted to String so that I don't need to change all the code logic.
To sum up, How to make all the org.apache.avro.util.Utf8
in GenericRecord
automatically converted into java.lang.String
?
looks like the solution is to use AvroKey
instead of AvroWrapper
. below code works, all theorg.apache.avro.util.Utf8
will automatically convert to java.lang.String
. No exception any more.
val inputData = sc.newAPIHadoopFile(inputPath,
classOf[AvroKeyInputFormat[GenericRecord]],
classOf[AvroKey[GenericRecord]],
classOf[NullWritable]).map(t =>
{ val genericRecord = t._1.datum()
(String)genericRecord.get("name") });