Basic Mapreduce with threads is slower than sequential version

I am trying to do a word counter with mapreduce using threads, but this version is much slower than the sequential version. With a 300MB text file the mapreduce version takes about 80s, with the sequential version it takes significantly less. My question is due to not understanding why, as I have done all the stages of map reduce (split, mapping, shuffle, reduce) but I can't figure out why it is slower, as I have used about 6 threads to do the test. I was thinking that it could be that the creation of threads was expensive compared to the execution time, but since it takes about 80s I think it is clear that this is not the problem. Could you take a look at the code to see what it is? I'm pretty sure that the code works fine, the problem is that I don't know what is causing the slowness. One last thing, when using a text file of more than 300MB, the program fills all the ram memory of my computer, is there any way to optimize it?

Solution

First of all several disclaimers:

to know the exact reason why the application is slow you need to profile it. In this answer I'm giving some common sense reasoning.
I'm assuming you are using cPython

When you parallelize some algorithm there are several factors that that influence performance. Some of them work in favour of speed (I'l mark them with +) and some against (-). Let's look at them:

you need to split the work first (-)
work is parallel workers is done simultaneously (+)
parallel workers may need to synchronize their work (-)
reduce requires time (-)

In order for you parallel algorithm give you some gain as compared to sequential you need that all factors that speeds things up overweight all factors that drags you down.

Also the gain from #2 should be big enough to compensate for the additional work you need to do as compared to sequential processing (this means that for some small data you will not get any boost as overhead for coordination will be bigger).

The main problems in your implementation now are items #2 and #3.

First of all the workers are not working in parallel. The portion of the task you parallelize is CPU bound. In python threads of a single process cannot use more than one CPU. So in this program they never execute in parallel. They share the same CPU.

Moreover every modification operation they do on the dicts uses locking/unlocking and this is much slower then sequential version that does not require such synchronization.

To speed up your algorithm you need:

use multiprocessing instead of multithreading (this way you can use multiple CPU for processing)
structure the algorithm in a way that does not require synchronization between workers when they do their job (each worker should use its own dict to store intermediate results)