Search code examples
hadoopapache-pighortonworks-sandbox

How to join two relations in pig with multiple fields


I've two CSV files:

1- Fertiltiy.csv :

enter image description here

2- Life Expectency.csv :

enter image description here

I want to join them in pig so that the result will be like this:

enter image description here

I am new to pig, I couldn't get the correct answer, but here is my code:

fertility = LOAD 'fertility' USING org.apache.hcatalog.pig.HCatLoader();

lifeExpectency = LOAD 'lifeExpectency' USING   org.apache.hcatalog.pig.HCatLoader();

A = JOIN fertility by country, lifeExpectency by country; 

B = JOIN fertility by year, lifeExpectency by year; 

C = UNION A,B;

DUMP C; 

Here is the result of my code:

enter image description here


Solution

  • You have the join by country and year and select the necessary columns needed for your final output.

    fertility = LOAD 'fertility' USING org.apache.hcatalog.pig.HCatLoader();
    lifeExpectency = LOAD 'lifeExpectency' USING   org.apache.hcatalog.pig.HCatLoader();
    
    A = JOIN fertility by (country,year), lifeExpectency by (country,year); 
    B = FOREACH A GENERATE  fertility::country,fertility::year,fertility::fertility,lifeExpectency::lifeExpectency;  
    DUMP B;