Search code examples
hadoophivereportingupdatesbusiness-intelligence

how to manage modified data in Apache Hive


We are working on Cloudera CDH and trying to perform reporting on the data stored on Apache Hadoop. We send daily reports to client so need to import data from operational store to hadoop daily.

Hadoop works on the append only mode. Hence we can not perform the Hive update/delete query. We can perform Insert overwrite on dimension tables and add delta values in the fact tables. Introducing thousands for the delta rows daily does not seem quite impressive solution.

Are there any other standard better ways to update modified data in Hadoop?

Thanks


Solution

  • HDFS might be append only, but Hive does support updates from 0.14 on.

    see here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update

    A design pattern is to take all your previous and current data and insert it into a new table every time.

    Depending on your usecase have a look at Apache Impala/Hbase/... or even Drill.