I try to use mapreducer in python with library dumbo. Below is my testing code for experiment and i hope i can receive all record from mapper to reducer output.
def mapper(key, value):
fields = value.split("\t");
myword = fields[0] + "\t" + fields[1]
yield myword, value
def reducer(key, values):
for value in values:
mypid = value
words = value.split("\t")
global count
count = count + 1
myword = str(count) + "--" + words[1] ##to count total lines in recuder's output records
yield myword, 1
if __name__ == "__main__":
dumbo.run(mapper, reducer)
Below is the log of Map-Reduce Framework . I expect the "Reduce input records" equal "Reduce output records" , but it is not . what's wrong with my testing code or i misunderstand something in mapreducer ? thanks.
Map-Reduce Framework
Map input records=405057
Map output records=405057
Map output bytes=107178919
Map output materialized bytes=108467155
Input split bytes=2496
Combine input records=0
Combine output records=0
Reduce input groups=63096
Reduce shuffle bytes=108467155
Reduce input records=405057
Reduce output records=63096
Spilled Records=810114
it is work when modify reducer as below:
def reducer(key, values):
global count
for value in values:
mypid = value
words = value.split("\t")
count = count + 1
myword = str(count) + "--" + words[1] ##to count total lines in recuder's output records
yield myword, 1
I expect the "Reduce input records" equal "Reduce output records" , but it is not .
I'm not sure why you expect this. The whole point of the reducer is that it receives a group of values at once (based on the key emitted by the mapper); and your reducer only emits one record for each group (yield myword, 1
). So the only way your "Reduce input records" would equal your "Reduce output records" would be the same is if each group contained exactly one record — that is, if the first two fields in each value were unique in your record-set. Since that's apparently not the case, your reducer emits fewer records than it receives.
(This is, in fact, the usual pattern; it's the reason the "reducer" is called that. The name comes from 'reduce' in functional languages, which reduces a collection of values to a single value.)