hadoop apache-spark hive apache-spark-sql

Spark SQL throwing error "java.lang.UnsupportedOperationException: Unknown field type: void"

I am getting below error in Spark(1.6) SQL while creating a table with column value default as NULL. Ex: create table test as select column_a, NULL as column_b from test_temp;

The same thing works in Hive and creates the column with data type "void".

I am using empty string instead of NULL to avoid the exception and new column getting string data type.

Is there any better way to insert null values in hive table using spark sql ?

2017-12-26 07:27:59 ERROR StandardImsLogger$:177 - org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Unknown field type: void
    at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:789)
    at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:746)
    at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply$mcV$sp(ClientWrapper.scala:428)
    at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:426)
    at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:426)
    at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:293)
    at org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:239)
    at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:238)
    at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:281)
    at org.apache.spark.sql.hive.client.ClientWrapper.createTable(ClientWrapper.scala:426)
    at org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation$lzycompute$1(CreateTableAsSelect.scala:72)
    at org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation$1(CreateTableAsSelect.scala:47)
    at org.apache.spark.sql.hive.execution.CreateTableAsSelect.run(CreateTableAsSelect.scala:89)
    at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
    at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
    at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:56)
    at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:153)
    at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:145)
    at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:130)
    at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
    at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:829)

Solution

I couldn't find much information regarding the datatype void but it looks like it is somewhat equivalent to the Any datatype we have in Scala.

The table at the end of this page explains that a void can be cast to any other data type.

Here are some JIRA issues that are kinda similar to the problem you are facing

So, as mentioned in the comment, instead of NULL you can cast it to any of the implicit data types.

select cast(NULL as string) as column_b