Search code examples
javahadoopsubstringapache-piguser-defined-functions

How to use substring operation on all the columns of a flat file in PIG latin


I need to limit length of all the values in each column of a flat file to 10000 using PIG. I have used substring operation on few columns but not able to figure out a way for all the columns.

Point to be noted : have no idea on the column count.

Thanks in advance.


Solution

  • Load the data as a single field.Write a UDF and pass the field as parameter.In your UDF use a loop to go through all columns by splitting the field based on the delimiter and limit all columns to desired length.Reconstruct the line and return the single field.The below script and UDF should get you on the right track.

    Compile the UDF into a jar and register the jar in your pig script.

    PIG

    REGISTER \path\TrimCols.jar;
    
    DEFINE TrimCols com.company.myproject.TrimCols();
    
    A = LOAD '/path/file.txt' USING TextLoader() AS (line:chararray);
    B = FOREACH A GENERATE TrimCols(line);
    DUMP B;
    

    UDF

    import java.io.IOException;
    import org.apache.pig.EvalFunc;
    import org.apache.pig.data.Tuple;
    
    public class TrimCols extends EvalFunc<String> {
    
    public String exec(Tuple input) throws IOException {
    
        if (input != null && input.size() != 0) 
        {
            String line = input.toString();
            String [] items = line.split(","); -- Use whatever delimiter your columns are separated by.
                try 
                {
                    StringBuilder s = new StringBuilder();
                    for(int i=0;i < items.length;i++)
                    {
                        if(items[i] != null && !items[i].toString().isEmpty() && items[i].Length > 10000)
                            s.append(items[i].substring(0,10000));
                        else
                            s.append(items[i]);
                        if(i < items.length-1)
                            s.append(","); -- Add the delimiter again,You will need this to split the trimmed cols in your pig script
                    }
                    return s.toString();    
    
                } catch (Exception e) 
                {
                    return line;
                }
        }
        else
            return "INPUT_NULL"; -- return whatever you want,so that you can handle this in your pigscript
      }
    }