I have a requirement I want to meet. I need to sqoop over data from a DB to Hive. I am sqooping on a daily basis since this data is updated daily.
This data will be used as lookup data from a spark consumer for enrichment. We want to keep a history of all the data we have received but we don't need all the data for lookup only the latest data (same day). I was thinking of creating a hive view from the historical table and only showing records that were inserted that day. Is there a way to automate the view on a daily basis so that the view query will always have the latest data?
Q: Is there a way to automate the view on a daily basis so that the view query will always have the latest data?
No need to update/automate the process if you get a partitioned table based on date.
Q: We want to keep a history of all the data we have received but we don't need all the data for lookup only the latest data (same day).
NOTE : Either hive view or hive table you should always avoid scanning the full table data aka full table scan for getting latest partitioned data.
Option 1: hive approach to query data
If you want to adapt hive approach
you have to go with partition column for example : partition_date
and partitioned table in hive
select * from table where partition_column in
(select max(distinct partition_date ) from yourpartitionedTable)
or
select * from (select *,dense_rank() over (order by partition_date desc) dt_rnk from db.yourpartitionedTable ) myview
where myview.dt_rnk=1
will give the latest partition always. (if same day or todays date is there in partition data then it will give the same days partition data otherwise it will give max partition_date) and its data from the partition table.
Option 2: Plain spark approach to query data
with spark show partitions
command i.e. spark.sql(s"show Partitions $yourpartitionedtablename")
get the result in array and sort that to get latest partition date. using that you can query only latest partitioned date as lookup data using your spark component.
see my answer as an idea for getting latest partition date.
I prefer option2 since no hive query is needed and no full table query since we are using show partitions command. and no performance bottle necks and speed will be there.
One more different idea is querying with HiveMetastoreClient
or with option2... see this and my answer and the other