Search code examples
pythonnumpymatrixtimemultiplication

CPU Time for matrix matrix multiplication


I am trying to decide wether several similar but independent problems should be dealt with simultaneously or sequentially (possibly in parallel on different computers). In order to decide, I need to compare the cpu times of the following operations :

  • time_1 is the time for computing X(with shape (n,p)) @ b (with shape (p,1)).

  • time_k is the time for computing X(with shape (n,p)) @ B (with shape (p,k)).

where X, b and B are random matrices. The difference between the two operations is the width of the second matrix.

Naively, we expect that time_k = k x time_1. With faster matrix multiplication algorithms (Strassen algorithm, Coppersmith–Winograd algorithm), time_k could be smaller than k x time_1 but the complexity of these algorithms remains much larger than what I observed in practice. Therefore my question is : How to explain the large difference in terms of cpu times for these two computations ?

Comparison of the CPU times in terms of the width k

The code I used is the following :

import time
import numpy as np
import matplotlib.pyplot as plt

p     = 100
width = np.concatenate([np.arange(1, 20), np.arange(20, 100, 10), np.arange(100, 4000, 100)]).astype(int)

mean_time = []
for nk, kk in enumerate(width):
    timings = []
    nb_tests = 10000 if kk <= 300 else 100
    for ni, ii in enumerate(range(nb_tests)):
        print('\r[', nk, '/', len(width), ', ',  ni, '/', nb_tests, ']', end = '')
        x     = np.random.randn(p).reshape((1, -1))
        coef  = np.random.randn(p, kk)
        d     = np.zeros((1, kk))
        start = time.time()
        d[:]  = x @ coef
        end   = time.time()
        timings.append(end - start)

    mean_time.append(np.mean(timings))

mean_time = np.array(mean_time)


fig, ax = plt.subplots(figsize =(14,8))
plt.plot(width, mean_time, label =  'mean(time\_k)')
plt.plot(width, width*mean_time[0], label = 'k*mean(time\_1)')
plt.legend()
plt.xlabel('k')
plt.ylabel('time (sec)')
plt.show()

Solution

  • This detail of the reason is very complex. You know that when PC run the X @ b, it will execute many other required instructions, maybe load data from RAM to cache and so on. In other words, the cost time contains two parts - the 'real calculate instructions' in CPU represented by Cost_A and 'other required instructions' represented by Cost_B. I have a idea, just my guess, that it's the Cost_B lead to time_k << k x time_1.

    For the shape of b is small (eg 1000 x 1), the 'other required instructions' cost relatively the most time. For the shape of b is huge (eg 1000 x 10000), it's relatively small. The following group of experiments could give a less rigorous proof. We can see that when the shape of b increases from (1000 x 1) to (1000 x ) the cost time increases very slowly.

    import numpy as np
    import time
    
    X = np.random.random((1000, 1000))
    
    b = np.random.random((1000, 1))
    b3 = np.random.random((1000, 3))
    b5 = np.random.random((1000, 5))
    b7 = np.random.random((1000, 7))
    b9 = np.random.random((1000, 9))
    b10 = np.random.random((1000, 10))
    b30 = np.random.random((1000, 30))
    b60 = np.random.random((1000, 60))
    b100 = np.random.random((1000, 100))
    b1000 = np.random.random((1000, 1000))
    
    def test_cost(X, b):
        begin = time.time()
        for i in range(100):
            _ = X @ b
        end = time.time()
        print((end-begin)/100.)
    
    test_cost(X, b)
    test_cost(X, b3)
    test_cost(X, b5)
    test_cost(X, b7)
    test_cost(X, b9)
    test_cost(X, b10)
    test_cost(X, b30)
    test_cost(X, b60) 
    test_cost(X, b100)
    test_cost(X, b1000)
    
    output:
    0.0003210139274597168
    0.00040063619613647463
    0.0002452659606933594
    0.00026523590087890625
    0.0002449488639831543
    0.00024344682693481446
    0.00040068864822387693
    0.000691361427307129
    0.0011700797080993653
    0.009680757522583008
    

    For more, I do a set of experiments with pref in linux. For the pref, the Cost_B maybe more big. I have 8 python files, the first one is as follows.

    import numpy as np
    import time
    def broken2():
        mtx = np.random.random((1, 1000))
        c = None
        c = mtx ** 2
    
    broken2()
    

    I had process the output to table A, as follows. enter image description here

    I do a simple analysis that I divide the error of the number of operation (likes, cache-misses) in neighbor experiments by the error of time elapsed(seconds) . Then, I get the following table B. From the table, we can find that as the shape of b increasing the linear relation between of shape and cost time is more obvious. And maybe the main reason that lead to time_k << k x time_1 is cache misses(load data from RAM to cache), for it stabilized firstly.

    enter image description here