I am new to Hadoop programming, looking for help in pig. I have data coming from simple.txt
format as ,
delimeter. I have two use cases. I want to do ltrim(rtrim())
on all the columns and turn to UPPER
for selected fields.
Here is my script:
party = Load '/party_test_pig.txt' USING PigStorage(',') AS(....);
Trim_party = FOREACH Upper_party GENERATE TRIM(*);
Upper_party = FOREACH party GENERATE UPPER(col1), UPPER(col2), UPPER(col3);
Upper_party:
After making it uppercase, I want to view all the columns and not only columns that get change to upper case.
Trim_party:
did some research and found out, to trim all columns I will have to write an UDF. I can do Trim_party = FOREACH Upper_party GENERATE TRIM(col1)...TRIM(coln);
but I feel this is not an efficient way and time-consuming.
Is there any other way, I could make this script work without writing UDF for Trim?
Thanks in advance.
it woulf be easier if you give a sample of your data. From what I understand, I would do this way :
-- Load each line as one string with TextLoader
A = LOAD '/user/guest/Pig/20151112.PigTest.txt' USING TextLoader() AS (line:CHARARRAY);
-- Apply TRIM and UPPER transformation, it will keep spaces that are inside your strings
B = FOREACH A GENERATE UPPER(line) AS lineUP;
-- Split lines with your delimiter
C = FOREACH B GENERATE FLATTEN(STRSPLIT(lineUP, ',')) AS (col1:CHARARRAY, ... ,coln:CHARARRAY);
-- Select the columns you need
D = FOREACH C GENERATE TRIM(col1) AS col1T, ..., TRIM(coln) AS colnT;