Search code examples
hivehql

Hive query fails with file not found error that should not be possible


I have a hive external table that is partitioned by the date and time it was inserted, say for example 20200331_0505 which is in YYYYMMDD_HHMM format.

Currently there is only one partition:

> hdfs dfs -ls /path/to/external/table    
-rw-r----- 2020-03-31 05:06 /path/to/external/table/_SUCCESS  
drwxr-x--- 2020-03-31 05:06 /path/to/external/table/loaddate=20200331_0505  

And if I run a hive query to find the partitions:

select distinct loaddate from table;  
+----------------+
|    loaddate    |
+----------------+
| 20200331_0505  |
+----------------+

That is expected and what I want to see, but if I run this:

select * from table where loaddate=(select max(loaddate) from table);
Then I get this error:

ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 3, vertexId=vertex_1585179445264_14095_4_00, diagnostics=[Vertex vertex_1585179445264_14095_4_00 [Map 3] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: <Table> initializer failed, vertex=vertex_1585179445264_14095_4_00 [Map 3], java.lang.RuntimeException: ORC split generation failed with exception: java.io.FileNotFoundException: File hdfs://path/to/external/table/loaddate=20200327_0513 does not exist.
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:524)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:779)
        at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)

So it is trying to load a partition that does not exist, 20200327_0513, what could be causing this?


Solution

  • When you delete partitions either directly with the rm command or with something like a SaveMode.Overwrite write command, it does not alert hive to the changes in partitions, so you need to tell hive that the partitions have changed. There are many ways to do this, the way I chose to fix it was:

    msck repair table <table> sync partitions