Search code examples
hadoopreducerecordsmapper

Why Reduce input records different with Reduce output records?


I try to use mapreducer in python with library dumbo. Below is my testing code for experiment and i hope i can receive all record from mapper to reducer output.

def mapper(key, value):
    fields = value.split("\t");    
    myword = fields[0] + "\t" + fields[1]
    yield myword, value

def reducer(key, values):
    for value in values:
        mypid = value
        words = value.split("\t")
    global count
    count = count + 1
    myword = str(count) + "--" + words[1]  ##to count total lines in recuder's output records
    yield myword, 1

if __name__ == "__main__":
    dumbo.run(mapper, reducer)

Below is the log of Map-Reduce Framework . I expect the "Reduce input records" equal "Reduce output records" , but it is not . what's wrong with my testing code or i misunderstand something in mapreducer ? thanks.

    Map-Reduce Framework
            Map input records=405057
            Map output records=405057
            Map output bytes=107178919
            Map output materialized bytes=108467155
            Input split bytes=2496
            Combine input records=0
            Combine output records=0
            Reduce input groups=63096
            Reduce shuffle bytes=108467155
            Reduce input records=405057
            Reduce output records=63096
            Spilled Records=810114

it is work when modify reducer as below:

def reducer(key, values):
    global count
    for value in values:
        mypid = value
        words = value.split("\t")

        count = count + 1
        myword = str(count) + "--" + words[1]  ##to count total lines in recuder's output records
        yield myword, 1

Solution

  • I expect the "Reduce input records" equal "Reduce output records" , but it is not .

    I'm not sure why you expect this. The whole point of the reducer is that it receives a group of values at once (based on the key emitted by the mapper); and your reducer only emits one record for each group (yield myword, 1). So the only way your "Reduce input records" would equal your "Reduce output records" would be the same is if each group contained exactly one record — that is, if the first two fields in each value were unique in your record-set. Since that's apparently not the case, your reducer emits fewer records than it receives.

    (This is, in fact, the usual pattern; it's the reason the "reducer" is called that. The name comes from 'reduce' in functional languages, which reduces a collection of values to a single value.)