Search code examples
hadoopelastic-map-reduce

When can we init resources for a hadoop Mapper?


I have a small sqlite database (post code -> US city name) and I have a big S3 file of users. I would like to map every user to the city name associated to their postcode.

I follow the famous WordCount.java example but Im not sure how mapReduce works internally:

  • Is my mapper created once per s3 input file?
  • Should I connect to the sqlite database on mapper creation ? Should I do so in the constructor of the mapper?

Solution

  • MapReduce is a framework for writing application to process the big data in parallel on large clusters of commodity hardware in reliable and fault tolerant manner. MapReduce executes on top of HDFS(Hadoop Distributed File System) in two different phases called map phase and reduce phase.

    Answer to your question Is my mapper created once per s3 input file?

    Mapper created equals to the number of splits and by default split is created equals to the number of block.

    High level overview is something like

    input file->InputFormat->Splits->RecordReader->Mapper->Partitioner->Shuffle&Sort->Reducer->final output

    Example,

    1. Your input files- server1.log,server2.log,server3.log
    2. InputFormat will create number of Split based on block size(by default)
    3. Corresponding to each Split a Mapper will allocated to work on each split.
    4. To get the line of record from the Split a RecordReader will be there in between Mapper and Split.
    5. Than Partitioner will started.
    6. After Partitioner Shuffle&Sort phase will start.
    7. Reducer
    8. Final output.

    Answer to your 2nd Question: Below are the three standard life cycle method of Mapper.

    @Override
     protected void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)
       throws IOException, InterruptedException {
    
      // Filter your data
      }
    
     }
     @Override
     protected void setup(Mapper<Object, Text, Text, IntWritable>.Context context)
       throws IOException, InterruptedException {
      System.out.println("calls only once at startup");
     }
     @Override
     protected void cleanup(Mapper<Object, Text, Text, IntWritable>.Context context)
       throws IOException, InterruptedException {
      System.out.println("calls only once at end");
     }