I am using yelps MRJob library for achieving map-reduce functionality. I know that map reduce has an internal sort and shuffle algorithm which sorts the values on the basis of their keys. So if I have the following results after map phase
(1, 24) (4, 25) (3, 26)
I know the sort and shuffle phase will produce following output
(1, 24) (3, 26) (4, 25)
Which is as expected
But if I have two similar keys and different values why does the sort and shuffle phase sorts the data on the basis of first value that appears?
For example If I have the following list of values from mapper
(2, <25, 26>) (1, <24, 23>) (1, <23, 24>)
The expected output is
(1, <24, 23>) (1, <23, 24>) (2, <25, 26>)
But the output that I am getting is
(1, <23, 24>) (1, <24, 23>) (2, <25, 26>)
is this MRjob library specific? Is there anyway to stop this sorting on the basis of values??
CODE
from mrjob.job import MRJob
import math
class SortMR(MRJob):
def steps(self):
return [
self.mr(mapper=self.rangemr,
reducer=self.rangesort)]
def rangemr(self, key, line):
for a in line.split():
yield 1,a
def rangesort(self,numid,line):
for a in line:
yield(1, a)
if __name__ == '__main__':
SortMR.run()
The local MRjob just uses the operating system 'sort' on the mapper output.
The mapper writes out in the format:
key<-tab->value\n
Thus you end up with the keys sorted primarily by key, but secondarily by value.
As noted, this doesn't happen in the real hadoop version, just the 'local' simulation.