Search code examples
hadoopapache-pigtrimuppercaseudf

PIG TRIM and UPPER


I am new to Hadoop programming, looking for help in pig. I have data coming from simple.txt format as , delimeter. I have two use cases. I want to do ltrim(rtrim()) on all the columns and turn to UPPER for selected fields.

Here is my script:

party = Load '/party_test_pig.txt' USING PigStorage(',') AS(....);
Trim_party = FOREACH Upper_party GENERATE TRIM(*);
Upper_party = FOREACH party GENERATE UPPER(col1), UPPER(col2), UPPER(col3);

Upper_party:After making it uppercase, I want to view all the columns and not only columns that get change to upper case.

Trim_party:did some research and found out, to trim all columns I will have to write an UDF. I can do Trim_party = FOREACH Upper_party GENERATE TRIM(col1)...TRIM(coln); but I feel this is not an efficient way and time-consuming.

Is there any other way, I could make this script work without writing UDF for Trim?

Thanks in advance.


Solution

  • it woulf be easier if you give a sample of your data. From what I understand, I would do this way :

    -- Load each line as one string with TextLoader
    A = LOAD '/user/guest/Pig/20151112.PigTest.txt' USING TextLoader() AS (line:CHARARRAY);
    -- Apply TRIM and UPPER transformation, it will keep spaces that are inside your strings
    B = FOREACH A GENERATE UPPER(line) AS lineUP;
    -- Split lines with your delimiter
    C = FOREACH B GENERATE FLATTEN(STRSPLIT(lineUP, ',')) AS (col1:CHARARRAY, ... ,coln:CHARARRAY);
    -- Select the columns you need
    D = FOREACH C GENERATE TRIM(col1) AS col1T, ..., TRIM(coln) AS colnT;