Search code examples
apache-sparkhiveapache-spark-sqlspark-avro

create hive external table with schema in spark


I am using spark 1.6 and I aim to create external hive table like what I do in hive script. To do this, I first read in the partitioned avro file and get the schema of this file. Now I stopped here, I get no idea how to apply this schema to my creating table. I use scala. Need help guys.


Solution

  • finally, I make it myself with old-fashioned way. With the help of code below:

    val rawSchema = sqlContext.read.avro("Path").schema
    val schemaString = rawSchema.fields.map(field => field.name.replaceAll("""^_""", "").concat(" ").concat(field.dataType.typeName match {
            case "integer" => "int"
            case smt => smt
          })).mkString(",\n")
    
          val ddl =
          s"""
             |Create external table $tablename ($schemaString) \n
             |partitioned by (y int, m int, d int, hh int, mm int) \n
             |Stored As Avro \n
             |-- inputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' \n
             | -- outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' \n
             | Location 'hdfs://$path'
           """.stripMargin
    

    take care no column name can start with _ and hive can't parse integer. I would like to say that this way is not flexible but work. if anyone get better idea, plz comment.