Hadoop returning less results than expected

I have two python scripts a mapper and reducer (basically reducer at this point just prints nothing else) and while locally i get 4 results - strings on hadoop i get 3. How does this work?

i use Amazon Elastic Map Reduce using Hadoop

#!/usr/bin/env python

import sys
import re
import os
# Constants declaration


# regular expressions

pattern = re.compile("[a-z]*", re.IGNORECASE)

a_to_f_pattern = re.compile("[a-f]", re.IGNORECASE)
g_to_l_pattern = re.compile("[g-l]", re.IGNORECASE)
m_to_r_pattern = re.compile("[m-r]", re.IGNORECASE)
s_to_z_pattern = re.compile("[s-z]", re.IGNORECASE)

# variables initialization

converted_word = ""
next_word = ""
new_character = ""
filename = ""
prev_filename = ""
i = 0

# Read pairs as lines of input from STDIN
for line in sys.stdin:


    filename = os.environ['mapreduce_map_input_file']
    filename = filename.replace("s3://source123/input/","")

    # check if its a new file, and reset start position
    if filename != prev_filename:

        START_POSITION = 0
        next_word = ""
        converted_word = ""
        prev_filename = filename

    # loop through every word that matches the pattern
    for word in pattern.findall(line):

                new_character = convert(word)
                converted_word = converted_word + new_character

                if len(converted_word) > (WINDOW - OVERLAP):
                    next_word = next_word + new_character

                # print "word= ", word
                # print "converted_word= ", converted_word

                END_POSITION = START_POSITION + (len(converted_word) - 1)

                print converted_word + "," + str(filename) + "," + str(START_POSITION) + "," + str(END_POSITION)

                new_character = convert(word)
                converted_word = next_word + new_character


  • The mapper task converts its inputs into lines and feed the lines to the stdin of the process.

    In this case, you have multiple input files and you're assuming that all the lines from different files are fed sequentially (i.e. file by file), but they are likely processed in parallel, so a mapper (getting a couple of input files) could be resetting its counters more than expected by a sequential distribution.