Search code examples
amazon-web-serviceshadoopapache-pigamazon-emrparquet

Not able to load data into pig on EMR from S3 bucket (parquet file)


I want to load the data from s3 bucket in Pig on EMR and my source file format is parquet:

Below command i have used:

A = LOAD 's3://test-1/icted/emp_db/emp_tb' 
USING parquet.pig.ParquetLoader(header__change_seq:chararray,header__change_oper:chararray,header__change_mask:chararray,header__stream_position:chararray,header__operation:chararray,header__transaction_id:chararray,header__timestamp:chararray,policylangaccessind_afi:chararray,loadcommandid:double,previousgroupid:double,enddate:chararray,assignedbyuserid:double,dstcd_afi:chararray);

I am not able to load data getting below are the error:

      ERROR pig.PigServer: exception during parsing: Error during parsing. <file test.pig, line 20, column 2>  mismatched input 'header__change_seq' expecting RIGHT_PAREN

Need help on this.


Solution

  • Couple of things:

    You should be using the full class path on emr org.apache.parquet.pig.ParquetLoader(); no need to pass it a schema, the parquet reader will infer it for you.

    Make sure you are using a version of the pig code that is compatible with the version of the parquet file (parquet tools can be used to find the version of parquet used)

    Just try to use most recent version https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig-bundle/1.10.0

    REGISTER parquet-pig-bundle-1.10.0.jar;