Search code examples
mongodbhadoopapache-pig

Mappers fail for pig to insert data into MongoDB


I am trying to import a file from HDFS to MongoDB using MongoInsertStorage with PIG. The files are large, around 5GB. The script runs fine when I run it in local mode with

 pig -x local example.pig

However if I run it in the mapreduce mode, Most of the mappers fail with the following error:

 Error: com.mongodb.ConnectionString.getReadConcern()Lcom/mongodb/ReadConcern; 
 Container killed by the ApplicationMaster. 
 Container killed on request. 
 Exit code is 143 Container exited with a non-zero exit code 143

Can someone help me solve this issue?? I also increased the memory allocated to YARN containers but that hasnt helped.

Some mappers are also timing out after 300 seconds.

Pig Script is as follows

REGISTER mongo-java-driver-3.2.2.jar  
REGISTER mongo-hadoop-core-1.4.0.jar
REGISTER mongo-hadoop-pig-1.4.0.jar
REGISTER mongodb-driver-3.2.2.jar

DEFINE MongoInsertStorage com.mongodb.hadoop.pig.MongoInsertStorage();

SET mapreduce.reduce.speculative true
BIG_DATA = LOAD 'hdfs://example.com:8020/user/someuser/sample.csv' using PigStorage(',') As (a:chararray,b:chararray,c:chararray); 

STORE BIG_DATA INTO 'mongodb://insert.some.ip.here:27017/test.samplecollection' USING MongoInsertStorage('', '')

Solution

  • Found a solution.

    For the error

    Error: com.mongodb.ConnectionString.getReadConcern()Lcom/mongodb/ReadConcern; 
     Container killed by the ApplicationMaster. 
     Container killed on request. 
     Exit code is 143 Container exited with a non-zero exit code 143
    

    I changed the JAR versions - hadoopcore and hadooppig from 1.4.0 to 2.0.2 and for Mongo Java driver from 3.2.2 to 3.4.2. This eliminated the ReadConcern Error on the mappers! For the timeout, I added this after registering the jars:

    SET mapreduce.task.timeout 1800000
    

    I had been using SET mapred.task.timeout which didnt work

    Hope this helps anyone who has a similar issue!