amazon-s3 hive streaming database-partitioning hadoop-partitioning

Hive Date Partitioned table - Streaming Data in S3 with mixed dates

I have extensive experience working with Hive Partitioned tables. I use Hive 2.X. I was interviewing for a Big Data Solution Architect role and I was asked the below question.

Question: How would you ingest a streaming data in a Hive table partitioned on Date? The streaming data is first stored in S3 bucket and then loaded to Hive. Although the S3 bucket names have a date identifier such as S3_ingest_YYYYMMDD, the content could have data for more than 1 date.

My Answer: Since the content could have more than 1 Date, creating external table might not be possible since we want to read the file and distribute the file based on the date. I suggested we first load the S3 bucket in an external staging table with no partitions and then Load/Insert the final date partition table using Dynamic Partition settings which will dynamically distribute the data to the correct partition directory.

The interviewer said my answer was not correct and I was curious to know what the correct answer was, but ran out of time.

The only caveat in my answer is that, over time the partitioned date directories will have multiple small files that can lead to small file issue, which can always be handled via batch maintenance process.

What are the other/correct options to handle this scenario?

Thanks.

Solution

It depends on the requirements.

As per my understanding if one file or folder with S3_ingest_YYYYMMDD files can contain more than one date, then some events are loaded the next day or even later. This is rather common scenario.

Ingestion date and event date are two different dates. Put ingested files into table partitioned by ingestion date (LZ). You can track the initial data. If reprocessing is possible, then use ingestion_date as a bookmark for reprocessing of LZ table.

Then schedule a process which will take two or more last days of ingestion date and load into table partitioned by event_date. Last day will be always incomplete, and may be you need to increase look-back period to 3 or even more ingestion days (using ingestion_date >= current_date - 2 days filter), it depends how many dates back ingestion may load event dates. And in this process you are using dynamic partitioning by event_date and applying some logic - cleaning, etc and loading into ODS or DM.

This approach is very similar to what you proposed. The difference is in first table, it should be partitioned to allow you process data incrementally and to do easy restatement if you need to change the logic or upstream data was also restated and reloaded in the LZ.