I'm trying to write PIG UDF for the below sample input file, and I'm specifying the expected output also. Please help me with the udf template for the same, or let me know if there is a way to do it without UDF.
my Sample input:
2014-01-23T08:12:09.259443
Device Type make year
-- ------------ --------- --------------------------------------- -------------
desktop commercial hp 2010
laptop commercial Asus 2013
mobile personal Sony 2014
2015-01-15T08:12:09.259443
Device Type make year
-- ------------ --------- --------------------------------------- -------------
desktop commercial hp 2015
laptop commercial Asus 2016
mobile personal Sony 2013
I basically need the output as timestamp followed by the fields in a delimited separated fashion, delimiter can be ',','\t','|'. for this instance I'm using ',' as delimiter.
Expected Output:
2014-01-23T08:12:09.259443, desktop, commercial, hp, 2010
2014-01-23T08:12:09.259443, laptop, commercial, Asus, 2013
2014-01-23T08:12:09.259443, mobile, personal, Sony, 2014
2015-01-15T08:12:09.259443, desktop, commercial, hp, 2015
2015-01-15T08:12:09.259443, laptop, commercial, Asus, 2016
2015-01-15T08:12:09.259443, mobile, personal, Sony, 2013
Note: I cant do pre-processing as there are some TB's of files
This is the logic:
while(str.hasMoreTokens()){
val=str.nextToken();
Pattern pa = Pattern.compile("\\d+[-]\\d+[-]\\d+[T]\\d+[:]\\d+[:]\\d+(\\.\\d+)?");
Matcher ma = pa.matcher(val);
boolean b = ma.matches();
if(b==true)
{
timestamp=ma.group().substring(0, 19);
//System.out.println(timestamp);
}
else if(val.contains("Device") || val.contains("Type") || val.contains("make ") || val.contains("year") || val.contains("--") || val.contains("------------") || val.contains("---------") || val.contains("---------------------------------------") || val.contains("-------------"))
{
}
else if(val!=timestamp){
result=timestamp+val;
}
if(result.length()>0){
System.out.println(result.substring(0));
}
}
Please let me know if there is more efficient/better way to do it. Thanks!