Search code examples
apache-pig

Group and count with null group key


I have a Pig script that generates some statistics for page views. Here's a snippet:

pageview_content_age_tmp = 
  FOREACH pageviews GENERATE ((1398816000000L-created)/86400000L)/7L as age;
pageview_content_age = 
  FOREACH pageview_content_age_tmp GENERATE (age>6L?6L:age) as age;
content_created_hist = 
  FOREACH (group pageview_content_age by age) GENERATE group as age, 
  COUNT(pageview_content_age) as count;
content_created_hist_sorted = ORDER content_created_hist by age;
dump content_created_hist_sorted;

This generates a histogram of the age of pages viewed (created is the timestamp when the content was created), e.g.,

(,0)
(0,8873)
(1,6701)
(2,9868)
(3,6466)
(4,12593)
(5,4971)
(6,51520)

The first row (,0) corresponds to the null key. However, it turns out that the result is wrong, i.e., the number of page views associated with the null key is much higher than 0 (that's caused by missing data, which is fine).

According to the documentation for version 0.10.1, the relation group pageview_content_age by age should group together all page view records corresponding to the null key together since it's grouping a single relation, but something is breaking along the way. I wonder if it's the previous relation that squashes all data with age > 7 together that's messing things up. Any clue?


Solution

  • COUNT ignores null values. You are looking for COUNT_STAR.