apache-spark hive apache-spark-sql parquet

Read Hive table and transform it to Parquet Table

The data is from a Hive table, to be more precise

The first table has the properties

Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat    
OutputFormat    org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

This Table should be transformed to have parquet and have the properties

Serde Library   org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe 
InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat   
OutputFormat    org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

The following Scala Spark code is executed:

val df = spark.sql("SELECT * FROM table")
df.write.format("parquet").mode("append").saveAsTable("table")

This results still in the unwanted the properties:

Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat    
OutputFormat    org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Hopefully somebody can help me

Solution

You can not mix different file formats in the same table, nor can you change the file format of a table with data in it. (To be more precise, you can do these things, but neither Hive nor Spark will be able to read the data that is in a format that does not match the metadata.)

You should write the data to a new table, make sure that it matches your expectations, then rename or remove the old table and finally rename the new table to the old name. For example:

CREATE TABLE new_table STORED AS PARQUET AS SELECT * FROM orig_table;
ALTER TABLE orig_table RENAME TO orig_table_backup;
ALTER TABLE new_table RENAME TO orig_table;

You can execute these SQL statements in a Hive session directly or from Spark using spark.sql(...) statements (one by one).