Search code examples
hadoopapache-piginner-join

Pig: field does not exist after join


It seems I'm somehow removing a key after join.

Context

The goal is to get all records from table_2 that consist only of distinct member ids from table_1.

As a novice, I'd appreciate pointers in best-practices to achieve this goal and any comments on why I get the "projected field" error.

What I've tried

-- assume %default vals set for path_1 and path_2 to data

-- load the data

table_1 = LOAD '$path_1' as (day, member_id, country);
table_2 = LOAD '$path_2' as (day, member_id, country);

-- get distinct member_id's from table_1

table_1_ids = DISTINCT(FOREACH table_1 GENERATE member_id as member_id);

-- get all records from table_2 that only have table_1_ids

new_table_2 = JOIN table_1_ids BY member_id, table_2 BY member_id;

the error

Invalid field projection. Projected field [member_id] does not exist in schema: table_1_ids::member_id:bytearray,table_2::day:bytearray, table_2::member_id:bytearray, table_2::country:bytearray.


Solution

  • Thanks to @piyush and @Nazar for the comments. One way to get the desired result is to i) uniquely label the keys to join on and ii) separate the distinct statement:

    table_1 = LOAD '$path_1' as (day, member_id_1, country); 
    table_2 = LOAD '$path_2' as (day, member_id_2, country);
    all_table_1_ids = FOREACH table_1 GENERATE member_id_1 as member_id_1;
    distinct_table_1_ids = DISTINCT all_table_1_ids;
    new_table_2 = JOIN distinct_table_1_ids BY member_id_1, table_2 BY member_id_2;