Search code examples
bigdataapache-pigapache-zeppelin

Comparing one record to another in another data set in Pig


I am a beginner to Pig and I am working with a large business data file. The business data file has cities from all over the world and I am looking to work with just the US cities. I have LOADED both data sets successfully and am trying to iterate over the total business data file and remove all cities that are not in the US cities dataset.

raw_bus_data = LOAD 'globalbus.json' USING com.twitter.elphantbird.pig.load.JsonLoader('-nestedLoad') as (row: map[]);

us_citystate = LOAD 'us.csv' USING PigStorage(',') AS (city:chararray, state:chararray);

I realize this next line doesn't work, but i'm trying to do something similar:

bus_us_data = FILTER raw_bus_data BY state == us_citystate state;

Is this something i would need to use FOREACH for?


Solution

  • You can do an inner join which will effectively filter out any rows that don't contain a US state. Because us_citystate has multiple rows for each state (one for each city), you will need to create a new relation B that contains only one row per state or else you will duplicate rows from bus_us_data. The code to do this and the inner join is:

    A = FOREACH us_citystate GENERATE state;
    B = DISTINCT A;
    C = JOIN raw_bus_data BY state, B BY state;
    bus_us_data = FOREACH C GENERATE 
        raw_bus_data::var1 AS var1, 
        raw_bus_data::var2 AS var2;
    

    Where var1, var2, etc. are the fields you wish to retain from raw_bus_data.