Search code examples
hadoopapache-pigdelimiterhadoop2udf

How to write PIG UDF for tab separated data and adding timestamp on left hand side?


I'm trying to write PIG UDF for the below sample input file, and I'm specifying the expected output also. Please help me with the udf template for the same, or let me know if there is a way to do it without UDF.

my Sample input:

2014-01-23T08:12:09.259443
   Device        Type         make                                    year
-- ------------  ---------  ---------------------------------------  -------------
   desktop       commercial   hp                                      2010
   laptop        commercial   Asus                                    2013
   mobile        personal     Sony                                    2014


2015-01-15T08:12:09.259443
   Device        Type         make                                    year
-- ------------  ---------  ---------------------------------------  -------------
   desktop       commercial   hp                                      2015
   laptop        commercial   Asus                                    2016
   mobile        personal     Sony                                    2013   

I basically need the output as timestamp followed by the fields in a delimited separated fashion, delimiter can be ',','\t','|'. for this instance I'm using ',' as delimiter.

Expected Output:

   2014-01-23T08:12:09.259443, desktop, commercial, hp, 2010
   2014-01-23T08:12:09.259443, laptop, commercial, Asus, 2013
   2014-01-23T08:12:09.259443, mobile, personal, Sony, 2014
   2015-01-15T08:12:09.259443, desktop, commercial, hp, 2015
   2015-01-15T08:12:09.259443, laptop, commercial, Asus, 2016
   2015-01-15T08:12:09.259443, mobile, personal, Sony, 2013

Note: I cant do pre-processing as there are some TB's of files


Solution

  • This is the logic:

    while(str.hasMoreTokens()){
                val=str.nextToken();
                Pattern pa = Pattern.compile("\\d+[-]\\d+[-]\\d+[T]\\d+[:]\\d+[:]\\d+(\\.\\d+)?");
                Matcher ma = pa.matcher(val);
                    boolean b = ma.matches();
                    if(b==true)
                    {
                    timestamp=ma.group().substring(0, 19);
                    //System.out.println(timestamp);
                            }
    
                    else if(val.contains("Device") || val.contains("Type") || val.contains("make                                    ") || val.contains("year") || val.contains("--") || val.contains("------------")  || val.contains("---------")  || val.contains("---------------------------------------")  || val.contains("-------------"))
                                {
    
                                }
                    else if(val!=timestamp){
    
                    result=timestamp+val;
    
                                } 
    
                    if(result.length()>0){
    
                    System.out.println(result.substring(0));
                                }   
            }
    

    Please let me know if there is more efficient/better way to do it. Thanks!