Search code examples
hadoophdfsbigdataapache-pig

Pig Latin JOIN error


I am loading two datasets A, B

A= LOAD  [datapath]
B= LOAD  [datapath]

I want to JOIN all fields of both A and B by id field.Both A and B have common field id and other fields. When I perform JOIN by id:

AB= JOIN A by id, B by id;

The resulted dataset AB includes two similar columns for the field id, However, it only must show only one column for the id field. What am I doing wrong here?


Solution

  • That's the expected behaviour, when joining two datasets, all columns are included (even those ones which you are joining by)

    You can check it here

    If you want to drop a column you can do it with the generate statement. But first you ned to know the position of the undesired column.

    If that column is, for instance, in the 3th position

    C = FOREACH AB GENERATE $1,$2, $4, $5...;
    

    Edit from the comments You can also use a generate statement without knowing position. Example:

    C = FOREACH AB GENERATE A::id AS id, A::foo AS foo, B::bar AS bar;