i have a pig script that loads a file by the "company" section of json. When I perform the count, count is 0 if domain is missing (or null) from the file. How can I have it group as empty string and still count that up?
Example of file:
{"company": {"domain": "test1.com", "name": "test1 company"}}
{"company": {"domain": "test1.com", "name": "test1 company"}}
{"company": {"domain": "test1.com", "name": "test2 company"}}
{"company": {"domain": "test2.com", "name": "test2 company"}}
{"company": {"domain": "test2.com", "name": "test3 company"}}
{"company": {"domain": "test3.com", "name": "test3 company"}}
{"company": {"domain": "test3.com", "name": "test3 company"}}
{"company": {"name": "test4 company"}}
{"company": {"name": "test4 company"}}
expected results:
"test1.com", "test1 company", 2
"test1.com", "test2 company", 1
"test2.com", "test2 company", 1
"test2.com", "test3 company", 1
"test3.com", "test3 company", 2
"", "test4 company", 2
Actual results:
"test1.com", "test1 company", 2
"test1.com", "test2 company", 1
"test2.com", "test2 company", 1
"test2.com", "test3 company", 1
"test3.com", "test3 company", 2
, "test4 company", 0
current pig script:
data = LOAD'myfile' USINGorg.apache.pig.piggybank.storage.JsonLoader('company: (domain:chararray, name:chararray)');
filtered = FILTER data BY (company is not null);
events = FOREACH filtered GENERATE FLATTEN(company) as (domain, name);
grouped = GROUP events BY (domain, name);
counts = FOREACH grouped GENERATE group as domain, COUNT(events) as count;
ordered = ORDER counts by count DESC;
thanks for the help!
Instead of COUNT try COUNT_STAR,
counts = FOREACH grouped GENERATE group as domain, COUNT_STAR(events) as count;