multithreading concurrency mapreduce threadpool

MapReduce - round-robin scheduling of mappers?

I was going through OSTEP's concurrency-mapreduce project which essentially involves building a toy MapReduce program which runs on a single machine using multiple threads. Towards the end, of the description, the following line is written:

Thread Management. This part is fairly straightforward. You should create num_mappers mapping threads, and assign a file to each Map() invocation in some manner you think is best (e.g., Round Robin, Shortest-File-First, etc.). Which way might lead to best performance? You should also create num_reducers reducer threads at some point, to work on the map'd output.

(emphasis mine)

So the idea basically is that we have a bunch of input files upon which we want to run Map() and we are restricted by num_mappers threads for mapping (kind of analogous to a restricted number of PCs, I guess). So, my idea is to simply make a thread pool with num_mapper threads and insert all files into the wait-queue in some order (eg. shortest file first, as the line suggests).

However, the line also mentions Round-Robin as a possible technique which can be used here - to my understanding, some sort of Round-Robin scheduler here would essentially involve letting a Map() invocation on one file do some work, then interrupt it and let another Map() invocation do its work, and so on, similar to your standard OS round robin scheduler.

The issue is that I don't really see a way this would be feasible - we are restricted by num_mappers threads so we can't make a thread for each input file and let the OS schedule it as it wishes, and the alternative of doing user-level scheduling would probably be very messy and not what was intended.

This makes me think that the "Round Robin" mentioned here isn't what I'm thinking it to be - could someone clarify what it would mean in the context of MapReduce mappers? Maybe point to some existing implementations of MapReduce with Round Robin (I couldn't find any)?

Solution

The way you distribute work in map-reduce is a form of scheduling.

I interpreted it as when num_files > num_mappers, how do you assign files to Map()ers. The opposite, when num_files <= num_mappers, it doesn't matter how you assign them.

In this context, RoundRobin would mean I assign file 0 to Map()er0, 1 to Map()er1 ... file num_mappers+1 to Map()er0, file num_mappers+2 to Map()er1, and so on.

An alternative, to re-arrange them by length, then distribute them this way could yield a better utilization, since it is roughly balanced.

However, in the case where there are a few very large files, and a greater number of smaller files, this approach would delay staring the large files; so reversing the sort would yield a better outcome.

Scheduling is fun.