Search code examples
hadoopapache-sparkavro

use spark to read avro data and got org.apache.avro.util.Utf8 cannot be cast to java.lang.String Exception


I am using following code to read avro in spark:

val inputData = sc.hadoopFile(inputPath,
  classOf[AvroInputFormat[GenericRecord]],
  classOf[AvroWrapper[GenericRecord]]).map(t => 
{ val genericRecord = t._1.datum()  
(String)genericRecord.get("name") });

the loading part works fine but the converting to String part failed:

Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be cast to java.lang.String

To simplify the example, I use a line

(String)genericRecord.get("name") 

Actually that part is from a library, which is used fine in a hadoop map reduce job. However, when I am using that library in spark now, it failed because of the above exception.

I know I can change the code to genericRecord.get("name").toString() to make it work, but because I am using it fine in another hadoop mapreduce job, I am hoping all the utf8 could be automatically converted to String so that I don't need to change all the code logic.

To sum up, How to make all the org.apache.avro.util.Utf8 in GenericRecord automatically converted into java.lang.String?


Solution

  • looks like the solution is to use AvroKey instead of AvroWrapper. below code works, all theorg.apache.avro.util.Utf8 will automatically convert to java.lang.String. No exception any more.

    val inputData = sc.newAPIHadoopFile(inputPath,
    classOf[AvroKeyInputFormat[GenericRecord]],
    classOf[AvroKey[GenericRecord]],
    classOf[NullWritable]).map(t => 
    { val genericRecord = t._1.datum()  
    (String)genericRecord.get("name") });