Search code examples
apache-pig

FLATTEN or SUBSTRING messes with JOIN in Pig


I have two datasets, data1 and data2.

data2 has following data,

a1:u:11#eve:f:6
a1:u:12#eve:f:6
a1:u2:13#eve:f:3
a1:u1:12#eve:s:6
a1:u1:11#eve:f:6

Here : as well as # are delimiters. I generate data2 finally as,

LOAD '$data2' USING PigStorage(':') AS
                 (ad: chararray,
                  a_id: chararray,
                  cid_eve1: chararray,
                  name: chararray,
                  len: int);

Then I separate 3rd column into two,

FOREACH data2 GENERATE
                  ad AS ad,
                  a_id AS a_id,
                  FLATTEN(STRSPLIT(cid_eve1, '#')) AS (cid: int, eve1: chararray),
                  name AS name,
                  len AS len;

Now, when I join data2 with data1, I get nothing.

I have also tried,

FOREACH data2 GENERATE
                  ad AS ad,
                  a_id AS a_id,
                  SUBSTRING(cid_eve1,0,INDEXOF(cid_eve1,'#',0)) AS cid: int,
                  name AS name,
                  len AS len;

This also returns nothing when joined. I am joining on 3rd column, cid.

I have even dumped data2 for both cases and seen the output. It is what is expected. But when I use following file as data2,

a1:u:11:eve:f:6
a1:u:12:eve:f:6
a1:u2:13:eve:f:3
a1:u1:12:eve:s:6
a1:u1:11:eve:f:6

and load as,

LOAD '$data2' USING PigStorage(':') AS
             (ad: chararray,
              a_id: chararray,
              cid: int,
              eve1: chararray,
              name: chararray,
              len: int);

Then the join returns correct results. I have no idea why this is happening. Can someone please help or give any suggestions.

data1, 2nd col($1) is a_id and last col is cid. Join is on both of them.

1,u,true,true,4,1,1,1,1,1,11,21,31,11
1,u,true,true,4,1,1,1,1,1,11,21,32,11
1,u,true,true,4,1,1,1,1,1,11,21,33,11
1,u,true,true,4,1,1,1,1,1,11,21,31,11
1,u,true,true,4,1,1,1,1,1,11,21,32,11
1,u,true,true,4,1,1,1,1,1,11,21,33,11
2,u,true,true,4,1,1,1,1,1,12,22,34,12
2,u,true,true,4,1,1,1,1,1,13,22,35,13
2,u1,true,false,4,1,1,1,1,0,12,22,34,12
2,u1,true,false,4,1,1,1,1,0,13,22,35,13
2,u1,true,true,9,1,1,1,1,1,12,22,34,12
2,u1,true,true,9,1,1,1,1,1,13,22,35,13
3,u,false,false,4,1,0,1,0,0,14,24,31,14
3,u,false,false,4,1,0,1,0,0,11,22,31,11
4,u,true,NULL,0,1,1,0,0,0,11,22,33,11
4,u1,false,NULL,0,1,0,0,0,0,11,22,33,11
2,u,true,true,4,1,1,1,1,1,12,22,34,12
2,u,true,true,4,1,1,1,1,1,13,22,35,13
2,u2,true,true,7,1,1,1,1,1,12,22,34,12
2,u2,true,true,7,1,1,1,1,1,13,22,35,13

Solution

  • I have found the answer. The problem lies in datatypes. I was trying to read chararray into int but without typecasting it.

    When I changed it to,

    FOREACH data2 GENERATE
                  ad AS ad,
                  a_id AS a_id,
                  (int)SUBSTRING(cid_eve1,0,INDEXOF(cid_eve1,'#',0)) AS cid,
                  name AS name,
                  len AS len;
    

    It worked.