Using: spark 1.5.2, hive 1.2 I have an external hive table in parquet format. I created a .py script that selects from my_table into a dataframe, does some transforms and then attempts to write back into the original table.
I've tried the following methods:
df.write.insertInto('table_name', overwrite='true')
. This throws the following error:
pyspark.sql.utils.AnalysisException: Cannot insert overwrite into table that is also being read from.
df.write.mode('overwrite').parquet('my_path')
df.write.parquet('my_path', mode='overwrite')
df.write.save('my_path', format='parquet', mode = 'overwrite')
These all seem to throw this error:
ERROR Client fs/client/fileclient/cc/client.cc:1802 Thread: 620 Open failed for file /my_path/part-r-00084-9, LookupFid error No such file or directory(2) 2016-04-26 16:47:17,0942 ERROR JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:2488 Thread: 620 getBlockInfo failed, Could not open file /my_path/part-r-00084-9 16/04/26 16:47:17 WARN DAGScheduler: Creating new stage failed due to exception - job: 16
**Note that method 1 above works fine if the file format is orc, but throws that error for parquet.
Any suggestions would be greatly appreciated!
From everything I've found thus far, the solution for reading and writing back into a parquet formatted file seems to be to write to a temporary/staging directory, delete the original directory, and then rename the temporary directory to your original. To do this in pyspark you will need the following commands:
import os
import shutil
shutil.rmtree('my_tmp_path')
os.rename('my_tmp_path', 'my_path)