Search code examples
joinapache-pigdump

Error after joining objects in Apache Pig


I have two data objects in pig.

data_1:

col_a: chararray,
col_b: int,
col_c: int,
col_d: chararray

data_2:

col_a: chararray,
col_b: chararray,
col_c: int,
col_d: int,
col_e: int

I want to join the two of them, I tried:

all_data = JOIN data_1 BY (col_a) LEFT, data_2 by (col_b);
all_data = JOIN data_1 BY (col_a), data_2 by (col_b);

When I tried to dump the object (after limit it to 10 records) Both options gave back the same error:

Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: all_data_limit: Limit - scope-6383 Operator Key: scope-6383): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: all_data: New For Each(true,true)[tuple] - scope-6382 Operator Key: scope-6382): org.apache.pig.backend.executionengine.ExecException: ERROR 0: java.lang.ClassCastException: org.apache.pig.impl.io.NullableText cannot be cast to org.apache.pig.impl.io.NullableBytesWritable
  • "Describe" for both objects (data_1, data_2) gave back good output (what I wrote at the top)
  • "describe" for the Joined object - all_data, also gave back a good output, as it should.
  • I printed LIMIT 10 for both objects - they have good data.
  • I'm using an Amazon cluster "emr-5.2.0", with Pig version 0.16.0

I'm getting a bit frustrated, couldn't find a solution to this and I'm searching for one for 3 days now... Any help would be great. Thanks!


Solution

  • use below commands

    all_data = JOIN data_1 BY TRIM(col_a) LEFT, data_2 by TRIM(col_b);
    all_data = JOIN data_1 BY TRIM(col_a), data_2 by TRIM(col_b);
    

    let me know if it'd worked without an error.