Search code examples
hadoopapache-pigbigdata

Issue with right shift in PIG


I have a CSV file, which contains data in following fashion:

data_id,data_text,data_author
1,"here some text...",anurag
2,"Hi, i am apsc...",apsc
3,"i am living in "NYC"",another user

I am doing following steps to load the correct data approach 1.

temp = LOAD'filepath' USING PigStorage(',');

when i am dumping temp, the data gets right shifted bcoz of an extra comma in 2nd record.

approach 2: Loading data using new line as a delimiter

temp = LOAD'filepath' USING PigStorage('\n');

it is giving me 1 record in 1 bag.

again i am trying to implement RegEx to break the bags

mydata = FOREACH data GENERATE FLATTEN(REGEX_EXTRACT_ALL('\\s*,\\s*,\\s*'));

it is throwing error :

Pig Stack Trace

ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.REGEX_EXTRACT_ALL as multiple or none of them fit. Please use an explicit cast.

org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1059: Problem while reconciling output schema of ForEach at org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.throwTypeCheckerException(TypeCheckingRelVisitor.java:142) at org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visit(TypeCheckingRelVisitor.java:182) at org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1635) at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1572) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1544) at org.apache.pig.PigServer.registerQuery(PigServer.java:516) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:991) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:538) at org.apache.pig.Main.main(Main.java:157) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:622) at org.apache.hadoop.util.RunJar.main(RunJar.java:160) Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.REGEX_EXTRACT_ALL as multiple or none of them fit. Please use an explicit cast. at org.apache.pig.newplan.logical.visitor.TypeCheckingExpVisitor.visit(TypeCheckingExpVisitor.java:775) at org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:88) at org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:70) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visitExpressionPlan(TypeCheckingRelVisitor.java:191) at org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visit(TypeCheckingRelVisitor.java:157) at org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:246) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visit(TypeCheckingRelVisitor.java:174)

... 19 more

Please help.


Solution

  • You can use csvexcelstorage to load your data.You will have to download piggybank.jar and register it in your pigscript.

    REGISTER /path_to_jar/piggybank.jar;
    DEFINE CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage();
    
    A = LOAD 'filepath/file.txt' USING CSVExcelStorage(',') AS (f1:int,f2:chararray,f3:chararray);
    DUMP A;