I have two datasets, data1
and data2
.
data2
has following data,
a1:u:11#eve:f:6
a1:u:12#eve:f:6
a1:u2:13#eve:f:3
a1:u1:12#eve:s:6
a1:u1:11#eve:f:6
Here :
as well as #
are delimiters. I generate data2
finally as,
LOAD '$data2' USING PigStorage(':') AS
(ad: chararray,
a_id: chararray,
cid_eve1: chararray,
name: chararray,
len: int);
Then I separate 3rd column into two,
FOREACH data2 GENERATE
ad AS ad,
a_id AS a_id,
FLATTEN(STRSPLIT(cid_eve1, '#')) AS (cid: int, eve1: chararray),
name AS name,
len AS len;
Now, when I join data2
with data1
, I get nothing.
I have also tried,
FOREACH data2 GENERATE
ad AS ad,
a_id AS a_id,
SUBSTRING(cid_eve1,0,INDEXOF(cid_eve1,'#',0)) AS cid: int,
name AS name,
len AS len;
This also returns nothing when joined. I am joining on 3rd column, cid
.
I have even dumped data2
for both cases and seen the output. It is what is expected. But when I use following file as data2
,
a1:u:11:eve:f:6
a1:u:12:eve:f:6
a1:u2:13:eve:f:3
a1:u1:12:eve:s:6
a1:u1:11:eve:f:6
and load as,
LOAD '$data2' USING PigStorage(':') AS
(ad: chararray,
a_id: chararray,
cid: int,
eve1: chararray,
name: chararray,
len: int);
Then the join returns correct results. I have no idea why this is happening. Can someone please help or give any suggestions.
data1
, 2nd col($1
) is a_id
and last col is cid
. Join is on both of them.
1,u,true,true,4,1,1,1,1,1,11,21,31,11
1,u,true,true,4,1,1,1,1,1,11,21,32,11
1,u,true,true,4,1,1,1,1,1,11,21,33,11
1,u,true,true,4,1,1,1,1,1,11,21,31,11
1,u,true,true,4,1,1,1,1,1,11,21,32,11
1,u,true,true,4,1,1,1,1,1,11,21,33,11
2,u,true,true,4,1,1,1,1,1,12,22,34,12
2,u,true,true,4,1,1,1,1,1,13,22,35,13
2,u1,true,false,4,1,1,1,1,0,12,22,34,12
2,u1,true,false,4,1,1,1,1,0,13,22,35,13
2,u1,true,true,9,1,1,1,1,1,12,22,34,12
2,u1,true,true,9,1,1,1,1,1,13,22,35,13
3,u,false,false,4,1,0,1,0,0,14,24,31,14
3,u,false,false,4,1,0,1,0,0,11,22,31,11
4,u,true,NULL,0,1,1,0,0,0,11,22,33,11
4,u1,false,NULL,0,1,0,0,0,0,11,22,33,11
2,u,true,true,4,1,1,1,1,1,12,22,34,12
2,u,true,true,4,1,1,1,1,1,13,22,35,13
2,u2,true,true,7,1,1,1,1,1,12,22,34,12
2,u2,true,true,7,1,1,1,1,1,13,22,35,13
I have found the answer. The problem lies in datatypes. I was trying to read chararray
into int
but without typecasting it.
When I changed it to,
FOREACH data2 GENERATE
ad AS ad,
a_id AS a_id,
(int)SUBSTRING(cid_eve1,0,INDEXOF(cid_eve1,'#',0)) AS cid,
name AS name,
len AS len;
It worked.