Search code examples
hivehdfsorc

How to retrieve only the latest version of the records from multiple ORC files (via a Hive table) which contain multiple versions for each record?


I have the below mentioned orc files for one table called EXAM_RESULTS. This table firstly had an initial load and then onward there would be incremental loads. The new records coming via the incremental loads might bring new records into the EXAM_RESULTS table or updated versions of the existing records.

The first four ORC files with files ending as part-m-00000 to 00003 are from the initial load and the other are from different incremental loads performed at different days and times are can be seen in the date time information in the filename.

-rw-r--r--   3 mysql hdfs      15808 2019-08-19 11:55 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/EXAM_RESULTS_2019-08-19_11-55-49.355.orc
-rw-r--r--   3 mysql hdfs      22068 2019-08-19 12:33 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/EXAM_RESULTS_2019-08-19_12-33-37.642.orc
-rw-r--r--   3 mysql hdfs      15714 2019-08-19 12:42 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/EXAM_RESULTS_2019-08-19_12-42-10.203.orc
-rw-r--r--   3 mysql hdfs      20297 2019-08-19 12:52 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/EXAM_RESULTS_2019-08-19_12-52-08.658.orc
-rw-r--r--   3 mysql hdfs      17082 2019-08-19 18:11 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/EXAM_RESULTS_2019-08-19_18-11-53.338.orc
-rw-r--r--   3 mysql hdfs      14446 2019-08-20 07:59 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/EXAM_RESULTS_2019-08-20_07-59-31.911.orc
-rw-r--r--   3 mysql hdfs      27128 2019-08-20 12:55 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/EXAM_RESULTS_2019-08-20_12-55-10.985.orc
-rw-r--r--   3 hdfs   hdfs   18652678 2019-07-30 14:37 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/part-m-00000
-rw-r--r--   3 hdfs   hdfs   22259778 2019-07-30 14:37 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/part-m-00001
-rw-r--r--   3 hdfs   hdfs   16816786 2019-07-30 14:37 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/part-m-00002
-rw-r--r--   3 hdfs   hdfs   18608955 2019-07-30 14:37 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/part-m-00003

Now, I want to query the records from this table in such a way that my SELECT query outputs the latest version of each record.

For exmaple: The existing table shows the below info (and would also be the output of a SELECT * FROM TEST_MARKS.EXAM_RESULTS query):

Student_Name    marks_sub1  marks_sub2  marks_sub3  marks_sub4  total_marks avg_marks
 ABC            67          89          76          93          325         81.25
 XYZ            23          14          80          56          173         43.25
 PQR            50          67          98          44          259         64.75
 ABC            67          92          78          93          329         82.25
 XYZ            88          61          76          56          281         70.25

Here, the students ABC and XYZ had their marks updated whereas PQR had no change in the data - these updated records for ABC and XYZ were inserted into the table via the incremental loads.

What I'm trying to achieve is I want to have a SELECT query should shows only the latest records for each record. For example, the result of my query should show the below data:

Student_Name    marks_sub1  marks_sub2  marks_sub3  marks_sub4  total_marks avg_marks
 ABC            67          92          78          93          329         82.25
 XYZ            88          61          76          56          281         70.25
 PQR            50          67          98          44          259         64.75

I have no idea how to do this as there is no column within my table itself that I can use to filter the latest data. I need to use the date and time information present in the filename in order to do this filtering in my SELECT query.

I'm totally new to Hive and HDFS and don't know how to achieve this.

Can someone please help me?


Solution

  • You can use INPUT__FILE__NAME virtual column in Hive and take record with greatest filename.

    New files like this: EXAM_RESULTS_2019-08-20_12-55-10.985.orc contain date in a sortable format and the whole filename can be sorted. Old files like this part-m-00000 are also sortable but the problem is that when comparing with new files 'part-m-00000'<'EXAM_RESULTS_2019-08-20_12-55-10.985.orc' = false. I propose to remove part-m- from the filename to make it sortable in a proper way, so '00000'< 'EXAM_RESULTS_2019-08-20_12-55-10.985.orc'. Then it is possible to use row_number to find latest record:

    select Student_Name    marks_sub1  marks_sub2  marks_sub3  marks_sub4  total_marks avg_marks
    from 
    (
    select e.*, row_number() over(partition by Student_Name order by filename desc) rn
    from
    (select regexp_replace(INPUT__FILE__NAME,'part-m-','') filename, e.* from TEST_MARKS.EXAM_RESULTS e )e
    )s
    where rn=1;