Search code examples
algorithmparallel-processingbig-omatrix-multiplication

Parallel and distributed algorithms for matrix multiplication


The problem comes when I looked up Wikipedia page of Matrix multiplication algorithm

It says:

This algorithm has a critical path length of Θ((log n)^2) steps, meaning it takes that much time on an ideal machine with an infinite number of processors; therefore, it has a maximum possible speedup of Θ(n3/((log n)^2)) on any real computer.

(The quote is from section "Parallel and distributed algorithms/Shared-memory parallelism.")

Since assuming there are infinite processors, the operation of multiplication should be done in O(1). And then just add all elements up, and this should be a constant time as well. Therefore, the longest critical path should be O(1) instead of Θ((log n)^2).

I was wondering if there is difference between O and Θ and where am I mistaken?


The problem has been solved, big thanks to @Chris Beck. The answer should be separated into two parts.

First, a low mistake is I do not count the time of summation. The summation takes O(log(N)) in operation( think about binary adding ).

Second, as Chris points out, the non-trivial problems takes time O(log(N)) in the processor. Above all, the longest critical path should be O(log(N)^2) instead of O(1).

For confusion of O and Θ, I found the answer in Big_O_Notation_Wikipedia.

Big O is the most commonly used asymptotic notation for comparing functions, although in many cases Big O may be replaced with Big Theta Θ for asymptotically tighter bounds.


I was wrong for the last conclusion. The O(log(N)^2) does not happen at summation and processor, but happen at when we split the matrix. Thanks @displayName for reminding me of this. In addition, Chris' answer for non trivial problem is still very useful for researching parallel system. Thank all warm heart answerers below!


Solution

  • There are two aspects to this question, addressing which the question will be completely answered.

    • Why can't we bring the run-time to O(1) by throwing in sufficient number of processors?
    • How is the critical path length for Matrix Multiplication equal to Θ(log2n)?

    Going after the questions one by one.


    Infinite number of processors

    The simple answer to this point is in understanding two terms viz. Task Granularity and Task Dependency.

    • Task Granularity - implies how fine the task decomposition is. Even if you have infinite processors, the maximum decomposition is still finite for a problem.
    • Task Dependency - implies that what are the steps that simply can be performed sequentially only. Like, you cannot modify the input unless you have read it. So modifying will always be preceded by reading of the input and cannot be done in parallel with it.

    So, for a process that has four steps A, B, C, D such that D is dependent on C, C is dependent on B and B is dependent on A, then a single processor will work as fast as 2 processors, will work as fast as 4 processors, will work as fast as infinite processors.

    This explains the first bullet.


    Critical Path Length for Parallel Matrix Multiplication

    1. If you had to divide a square matrix of size n X n into four blocks of size [n/2] X [n/2] each and then continue dividing until you reach down to a single element (or matrix of size 1 X 1) the number of levels this tree-like design would have is O(log (n)).
    2. Thus, for matrix multiplication in parallel, since we have to recursively divide not one but two matrices of size n, down to their last element, it takes O(log2n) time.
    3. In fact, this bound is tighter and is not just O(log2n), but Θ(log2n).

    The difference between Big O and Theta is that Big O only tells that a process won't go above what's mentioned by Big O, while Theta tells that function is not just having an upper bound, but also the lower bound with what's mentioned in Theta. Hence, effectively, the plot of the complexity of the function would be sandwiched between the same function, multiplied with two different constants as depicted in the image below, or in other words, the function will grow at the same rate:

    enter image description here

    Image taken from: http://xlinux.nist.gov/dads/Images/thetaGraph.gif

    So, I'd say that for your case, you can ignore the notation and you are not "gravely" mistaken between the two.


    To conclude...

    I'd like to define another term called Speedup or Parallelism. It is defined as the ratio of best sequential execution time (also called work) and parallel execution time. The best sequential access time, already given on the wikipedia page you've linked to is O(n3). The parallel execution time is O(log2n).

    Hence, the speedup is = O(n3/log2n).

    And even though the speedup looks so simple and straightforward, achieving it in actual cases is very difficult due to the communication costs that are inherent in moving data.