Search code examples
hadoopapache-pig

Pig - Mapping and retrieving two columns against master table?


I'm experimenting with pig on the openflights datasets (https://openflights.org/data.html). I'm currently trying to map a query that contains all the unique possible flight routes, i.e. the table below

+---------------+-------------+
| Start_Airport | End_Airport |
+---------------+-------------+
| YYZ           | NYC         |
| YBG           | YVR         |
| AEY           | GOH         |
+---------------+-------------+ 

And then join both values against a master table which contains the longitude and latitude of each airport. i.e.

+---------+----------+-----------+
| Airport | Latitude | Longitude |
+---------+----------+-----------+
| YYZ     |    -10.3 |      1.23 |
| YBG     |    -40.3 |      50.4 |
| AEY     |     30.3 |      30.3 |
+---------+----------+-----------+

How would I go about trying to do this? I am essentially trying to have a final table which looks like

+----------------+----------+-----------+-------------+----------+-----------+
| Start_Airport  | Latitude | Longitude | End_Airport | Latitude | Longitude |
+----------------+----------+-----------+-------------+----------+-----------+
| YYZ            |    -10.3 |      1.23 | NYC         | blah     | blah      |
| YBG            |    -40.3 |      50.4 | YVR         | blah     | blah      |
| AEY            |     30.3 |      30.3 | GOH         | blah     | blah      |
+----------------+----------+-----------+-------------+----------+-----------+

I'm currently trying to do as follows, with c being the first table

route_data = JOIN c by (start_airport, end_airport), airports_all by ($0, $0);

I'm thinking this essentially says for the query, join the starting_aiport and the ending_airport against the respective code and then pull through the respective longitude and latitude,


Solution

  • route_data = JOIN c by (start_airport, end_airport), airports_all by ($0, $0);

    This is similar to "and" conditions clause of a typical join query in sql world. Imagine below query. Will it yield your desired results. select * from c t1 join airports_all t2 on a.start_airport=b.first_field and a.end_airport=b.first_field; This will bring results only when both start_airport and end_airport are same.

    What you desire can be achieved in below way:

    cat > routes.txt
    YYZ,NYC
    YBG,YVR
    AEY,GOH
    
    cat > airports_all.txt
    YYZ,-10.3,1.23
    YBG,-40.3,50.4
    AEY,30.3,30.3
    

    Pig Code:

    tab1 = load '/home/ec2-user/routes.txt' using PigStorage(',') as (start_airport,end_airport);
    describe tab1
    tab2 = load '/home/ec2-user/airports_all.txt' using PigStorage(',') as (Airport,Latitude,Longitude);
    describe tab2
    tab3 = JOIN tab1 by (start_airport), tab2 by (Airport);
    describe tab3
    tab4 = foreach tab3 generate $0 as start_airport, $3 as start_Latitude, $4 as start_Longitude, $1 as end_airport;
    describe tab4
    tab5 = JOIN tab4 by (end_airport), tab2 by (Airport);
    describe tab5
    tab6 = foreach tab5 generate $0 as start_airport, $1 as start_Latitude, $2 as start_Longitude, $3 as end_airport, $5 as end_Latitude, $6 as end_Longitude;
    describe tab6
    dump tab6