Search code examples
hadoophiveapache-pigcdr

Count and find maximum number in Hadoop using pig


I have a table which contain sample CDR data in that column A and column B having calling person and called person mobile number I need to find whose having maximum number of calls made(column A) and also need to find to which number(column B) called most

the table structure is like below

calling called

  • 889578226 77382596
  • 889582256 77382596
  • 889582256 7736368296
  • 7785978214 782987522

in the above table 889578226 have most number of outgoing calls and 77382596 is most called number in such a way need to get the output

in hive i run like below

SELECT calling_a,called_b, COUNT(called_b) FROM cdr_data GROUP BY calling_a,called_b;

what might be the equalent code for the above query in pig?


Solution

  • Anas, Could you please let me know this is what you are expecting or something different?

    input.txt
    a,100
    a,101
    a,101
    a,101
    a,103
    b,200
    b,201
    b,201
    c,300
    c,300
    c,301
    d,400
    
    PigScript:
    A = LOAD 'input.txt' USINg PigStorage(',') AS (name:chararray,phone:long);
    B = GROUP A BY (name,phone);
    C = FOREACH B GENERATE FLATTEN(group),COUNT(A) AS cnt;
    D = GROUP C BY $0;
    E = FOREACH D {
                    SortedList = ORDER C BY cnt DESC;
                    top = LIMIT SortedList 1;
                    GENERATE FLATTEN(top);
                  }
    DUMP E;
    
    Output:
    (a,101,3)
    (b,201,2)
    (c,300,2)
    (d,400,1)