I have the below mentioned orc files for one table called EXAM_RESULTS.
This table firstly had an initial load and then onward there would be incremental loads. The new records coming via the incremental loads might bring new records into the EXAM_RESULTS table or updated versions of the existing records.
The first four ORC
files with files ending as part-m-00000
to 00003
are from the initial load and the other are from different incremental loads performed at different days and times are can be seen in the date time information in the filename.
-rw-r--r-- 3 mysql hdfs 15808 2019-08-19 11:55 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/EXAM_RESULTS_2019-08-19_11-55-49.355.orc
-rw-r--r-- 3 mysql hdfs 22068 2019-08-19 12:33 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/EXAM_RESULTS_2019-08-19_12-33-37.642.orc
-rw-r--r-- 3 mysql hdfs 15714 2019-08-19 12:42 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/EXAM_RESULTS_2019-08-19_12-42-10.203.orc
-rw-r--r-- 3 mysql hdfs 20297 2019-08-19 12:52 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/EXAM_RESULTS_2019-08-19_12-52-08.658.orc
-rw-r--r-- 3 mysql hdfs 17082 2019-08-19 18:11 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/EXAM_RESULTS_2019-08-19_18-11-53.338.orc
-rw-r--r-- 3 mysql hdfs 14446 2019-08-20 07:59 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/EXAM_RESULTS_2019-08-20_07-59-31.911.orc
-rw-r--r-- 3 mysql hdfs 27128 2019-08-20 12:55 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/EXAM_RESULTS_2019-08-20_12-55-10.985.orc
-rw-r--r-- 3 hdfs hdfs 18652678 2019-07-30 14:37 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/part-m-00000
-rw-r--r-- 3 hdfs hdfs 22259778 2019-07-30 14:37 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/part-m-00001
-rw-r--r-- 3 hdfs hdfs 16816786 2019-07-30 14:37 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/part-m-00002
-rw-r--r-- 3 hdfs hdfs 18608955 2019-07-30 14:37 /INFO/CLASSROOM/TEST_MARKS/EXAM_RESULTS/part-m-00003
Now, I want to query the records from this table in such a way that my SELECT
query outputs the latest version of each record.
For exmaple:
The existing table shows the below info (and would also be the output of a SELECT * FROM TEST_MARKS.EXAM_RESULTS
query):
Student_Name marks_sub1 marks_sub2 marks_sub3 marks_sub4 total_marks avg_marks
ABC 67 89 76 93 325 81.25
XYZ 23 14 80 56 173 43.25
PQR 50 67 98 44 259 64.75
ABC 67 92 78 93 329 82.25
XYZ 88 61 76 56 281 70.25
Here, the students ABC
and XYZ
had their marks updated whereas PQR
had no change in the data - these updated records for ABC
and XYZ
were inserted into the table via the incremental loads.
What I'm trying to achieve is I want to have a SELECT
query should shows only the latest records for each record. For example, the result of my query should show the below data:
Student_Name marks_sub1 marks_sub2 marks_sub3 marks_sub4 total_marks avg_marks
ABC 67 92 78 93 329 82.25
XYZ 88 61 76 56 281 70.25
PQR 50 67 98 44 259 64.75
I have no idea how to do this as there is no column within my table itself that I can use to filter the latest data. I need to use the date and time information present in the filename in order to do this filtering in my SELECT
query.
I'm totally new to Hive and HDFS and don't know how to achieve this.
Can someone please help me?
You can use INPUT__FILE__NAME virtual column in Hive and take record with greatest filename.
New files like this: EXAM_RESULTS_2019-08-20_12-55-10.985.orc
contain date in a sortable format and the whole filename can be sorted.
Old files like this part-m-00000
are also sortable but the problem is that when comparing with new files 'part-m-00000'<'EXAM_RESULTS_2019-08-20_12-55-10.985.orc' = false
. I propose to remove part-m-
from the filename to make it sortable in a proper way, so '00000'< 'EXAM_RESULTS_2019-08-20_12-55-10.985.orc'
. Then it is possible to use row_number
to find latest record:
select Student_Name marks_sub1 marks_sub2 marks_sub3 marks_sub4 total_marks avg_marks
from
(
select e.*, row_number() over(partition by Student_Name order by filename desc) rn
from
(select regexp_replace(INPUT__FILE__NAME,'part-m-','') filename, e.* from TEST_MARKS.EXAM_RESULTS e )e
)s
where rn=1;