I'm currently having an issue with my hive database. I got two external tables abc
and xyz
which are using a JSON-SerDe based on a hdfs directory. I need to read those tables with Power BI and due to the amount of JSON files I want to remove redundant rows in one of the two tables. I had two ideas, but don't know how to implement them:
create abc_distinct as
select distinct * from abc;
This works fine, but how can I keep the abc_distinct
-table up to date since the external table gets new files every day?
Additional information: My data comes from Apache NiFi.
Thank you in advance!
I don't think that Hive alone will provide a robust approach to your problem. You need a proper design for such use cases. I would start something like below.
Two zones on HDFS as,
Maintain NiFi ingestion logging (a table) into Postgres or MySQL for Raw/Landing zone and ETL process (Spark or Sqoop) will be driven by this logging table. Encompass your cleanup/dedup logic inside ETL process, this process will compare new Raw/Landing zone data with existing data in Cleaned zone and load the difference, you can build hash for each record/row for comparison and need to implement partitioning for performance.
For logging table implementation -
When NiFi successfully puts file on HDFS Raw/Landing zone, create a record and set some status like 'READY_FOR_DEDUP'. ETL process will pick only files which are having status as 'READY_FOR_DEDUP' and once dedup/cleanup is done and after successfully put into HDFS Cleaned zone process will set status for same record as 'COMPLETED' or 'FAILED', so that it won't pick such files in the next batch run.
I would say, write Spark application to perform ETL task so that you can schedule spark-submit
execution using cron, Airflow or Oozie.