I am unable to generate SUM(km) using trip_id from below code:
twa = LOAD 'hdfs://localhost:54310/sImport_20170508100625/t_waypoint_actual.txt' USING PigStorage('|') as
(id:int, trip_id:chararray, address_id:int, timestamp_utc:chararray, driver_id:int, ETA:chararray,event_id:int, imei_number:chararray, vehicle_imei_id:int,
km:double, avg_speed:double, duration:chararray, signal_strength:float, battery_strength:float, event_type:chararray);
twa_group = GROUP twa BY (id,trip_id,km);
twa_foreach = FOREACH twa_group GENERATE FLATTEN(group), twa.trip_id AS trip_id, (SUM(twa.km)) AS km;
twa_filter = FILTER twa_foreach BY (trip_id == '466');
DUMP twa_filter;
Error:
In alias twa_filter, incompatible types in Equal Operator left hand side:bag :tuple(trip_id:chararray) right hand side:chararray
I tried several methods but no output. Can anyone suggest me the correct solution. Thanks in advance.
Input:
id,trip_id,km
1,466,1.4
2,466,2.3
Expected Output:
trip_id,km
466,3.7
When you select column from grouped data the result is always a bag, but as you are grouping by this column you can just select it from group key.
twa_foreach = FOREACH twa_group GENERATE group.id as id, groum.km as km,
group.trip_id AS trip_id, (SUM(twa.km)) AS km;
twa_filter = FILTER twa_foreach BY (trip_id == '466');
In case you'll need to use column that's not in the key, you'll need to use limit 1
+ flatten
.
okay, inspected your code a bit. Looks like you want to obtain sum of km for each id, trip_id
pair. Assuming it
cat testdata/7.csv
:
1|456|2.5|somedata1
2|466|2.7|somedata2
2|466|2.7|somedata2
4|456|2.8|somedata3
4|456|2.9|somedata4
4|456|2.9|somedata4
5|466|2.5|somedata5
5|466|2.5|somedata5
And the pig script
twa = LOAD 'testdata/7.csv' USING PigStorage('|') as
(id:int, trip_id:chararray, km:double, event_type:chararray);
twa_group = GROUP twa BY (trip_id);
twa_foreach = FOREACH twa_group GENERATE group AS trip_id, (SUM(twa.km)) AS km;
twa_filter = FILTER twa_foreach BY (trip_id == '466');
DUMP twa_filter;
The result is
(466,10.4)
If this doesn't work for you - you are doing smth wrong. Also consider doing filtering before grouping as `group operation is really costly