Search code examples
sqlhadoophiveimpala

How to used daily data to construct yearly data?


I have a set of data that is structured as such:

[uid, product,   currency,  platform,  date]
[100, product_1, USA,       desktop,   2019-01-01]
[100, product_2, USA,       desktop,   2019-01-03]
[200, product_3, CAN,       mobile,    2019-01-02]
[300, product_1, GBP,       desktop,   2019-01-01]
and so on...

The data must be aggregated yearly:

[year, product,   currency, platform,  uid_count]
[2019, product_1, USA,      desktop,   1000]
[2019, product_2, USA,      desktop,   2000]
[2019, product_3, GBP,      mobile,    5000]

After researching a solution I read about sketching algorithms, which seem to be in the right direction. Essentially, the data is too large to load in one batch, so I need to process it incrementally, per day for example, so that I am not running a SQL query like:

SELECT year(date), product, currency, platform, count(distinct uid) FROM tbl_name GROUP BY 1, 2, 3, 4

OR

SELECT year(date), product, currency, platform, count(distinct uid) FROM tbl_name GROUP BY 1, 2, 3, 4
with cube

Solution

  • Unfortunately, count(distinct uid) is not additive and you need to re-iterate the whole year dataset, you cannot count one day distinct and add it to the existing cumulative year count distinct. Because if the same UIDs exist in many different days then count(distinct uid) on day one + count(distinct uid) on day two does not equal to count(distinct uid) calculated on these two days. This makes count(distinct) not scalable.

    But probably you can do some close estimation based on sketching algorithm if the estimation is applicable.

    There are few implementations of sketch algorithms for Hive ready for use.

    1. This HyperLogLog for Hive: HllHiveUDFs Sketches library from Yahoo

    2. Brickhouse sketch UDFs - "K-minimum values" sketching algorithm.

    3. One more implementation: https://github.com/MLnick/hive-udf/wiki