I am a beginner to Pig and I am working with a large business data file. The business data file has cities from all over the world and I am looking to work with just the US cities. I have LOADED
both data sets successfully and am trying to iterate over the total business data file and remove all cities that are not in the US cities dataset.
raw_bus_data = LOAD 'globalbus.json' USING com.twitter.elphantbird.pig.load.JsonLoader('-nestedLoad') as (row: map[]);
us_citystate = LOAD 'us.csv' USING PigStorage(',') AS (city:chararray, state:chararray);
I realize this next line doesn't work, but i'm trying to do something similar:
bus_us_data = FILTER raw_bus_data BY state == us_citystate state;
Is this something i would need to use FOREACH
for?
You can do an inner join which will effectively filter out any rows that don't contain a US state. Because us_citystate
has multiple rows for each state (one for each city), you will need to create a new relation B
that contains only one row per state or else you will duplicate rows from bus_us_data
. The code to do this and the inner join is:
A = FOREACH us_citystate GENERATE state;
B = DISTINCT A;
C = JOIN raw_bus_data BY state, B BY state;
bus_us_data = FOREACH C GENERATE
raw_bus_data::var1 AS var1,
raw_bus_data::var2 AS var2;
Where var1, var2, etc. are the fields you wish to retain from raw_bus_data
.