Search code examples
apache-pig

incompatible types in Equal Operator left hand side:bag :tuple(trip_id:chararray) right hand side:chararray


I am unable to generate SUM(km) using trip_id from below code:

twa = LOAD 'hdfs://localhost:54310/sImport_20170508100625/t_waypoint_actual.txt' USING  PigStorage('|') as
(id:int, trip_id:chararray, address_id:int, timestamp_utc:chararray, driver_id:int, ETA:chararray,event_id:int, imei_number:chararray, vehicle_imei_id:int,
km:double, avg_speed:double, duration:chararray, signal_strength:float, battery_strength:float, event_type:chararray);

twa_group = GROUP twa BY (id,trip_id,km);
twa_foreach = FOREACH twa_group GENERATE FLATTEN(group), twa.trip_id AS trip_id, (SUM(twa.km)) AS km;
twa_filter = FILTER twa_foreach BY (trip_id == '466');

DUMP twa_filter;

Error:

In alias twa_filter, incompatible types in Equal Operator left hand side:bag :tuple(trip_id:chararray)  right hand side:chararray

I tried several methods but no output. Can anyone suggest me the correct solution. Thanks in advance.

Input:
id,trip_id,km
1,466,1.4
2,466,2.3


 Expected Output:
    trip_id,km
     466,3.7

Solution

  • When you select column from grouped data the result is always a bag, but as you are grouping by this column you can just select it from group key.

    twa_foreach = FOREACH twa_group GENERATE group.id as id, groum.km as km, 
    group.trip_id AS trip_id, (SUM(twa.km)) AS km;
    twa_filter = FILTER twa_foreach BY (trip_id == '466');
    

    In case you'll need to use column that's not in the key, you'll need to use limit 1 + flatten.

    okay, inspected your code a bit. Looks like you want to obtain sum of km for each id, trip_id pair. Assuming it cat testdata/7.csv:

    1|456|2.5|somedata1
    2|466|2.7|somedata2
    2|466|2.7|somedata2
    4|456|2.8|somedata3
    4|456|2.9|somedata4
    4|456|2.9|somedata4
    5|466|2.5|somedata5
    5|466|2.5|somedata5
    

    And the pig script

    twa = LOAD 'testdata/7.csv' USING  PigStorage('|') as
    (id:int, trip_id:chararray, km:double, event_type:chararray);
    
    twa_group = GROUP twa BY (trip_id);
    twa_foreach = FOREACH twa_group GENERATE group AS trip_id, (SUM(twa.km)) AS km;
    twa_filter = FILTER twa_foreach BY (trip_id == '466');
    DUMP twa_filter;
    

    The result is

    (466,10.4)
    

    If this doesn't work for you - you are doing smth wrong. Also consider doing filtering before grouping as `group operation is really costly