Search code examples
hivepowerbiapache-nifi

Keeping table up to date - Hive


I'm currently having an issue with my hive database. I got two external tables abc and xyz which are using a JSON-SerDe based on a hdfs directory. I need to read those tables with Power BI and due to the amount of JSON files I want to remove redundant rows in one of the two tables. I had two ideas, but don't know how to implement them:

  1. Create a table with:
create abc_distinct as 
select distinct * from abc;

This works fine, but how can I keep the abc_distinct-table up to date since the external table gets new files every day?

  1. Remove data redundancy in the original table with SerDe? I found nothing on how to eliminate data redundancy within a SerDe-based hive table?

Additional information: My data comes from Apache NiFi.

Thank you in advance!


Solution

  • I don't think that Hive alone will provide a robust approach to your problem. You need a proper design for such use cases. I would start something like below.

    Two zones on HDFS as,

    1. Raw/Landing zone for JSON/AVRO files ingested by NiFi and can be accessed by Hive external tables
    2. Cleaned zone (Hive Managed ORC tables) which will act as a source for Power BI and populated by some ETL process by consuming Raw/Landing zone.

    Maintain NiFi ingestion logging (a table) into Postgres or MySQL for Raw/Landing zone and ETL process (Spark or Sqoop) will be driven by this logging table. Encompass your cleanup/dedup logic inside ETL process, this process will compare new Raw/Landing zone data with existing data in Cleaned zone and load the difference, you can build hash for each record/row for comparison and need to implement partitioning for performance.

    For logging table implementation -

    When NiFi successfully puts file on HDFS Raw/Landing zone, create a record and set some status like 'READY_FOR_DEDUP'. ETL process will pick only files which are having status as 'READY_FOR_DEDUP' and once dedup/cleanup is done and after successfully put into HDFS Cleaned zone process will set status for same record as 'COMPLETED' or 'FAILED', so that it won't pick such files in the next batch run.

    I would say, write Spark application to perform ETL task so that you can schedule spark-submit execution using cron, Airflow or Oozie.