numpy performance tensorflow matrix-multiplication intel-mkl

Why is my CPU Performance of Float64 tf.matmul in TensorFlow2 significantly slower than the NumPy matmul, even in the graph mode?

I'm comparing the single thread performance of the matrix-matrix products in TensorFlow 2 and NumPy. I compare separately for single precision (float32) and double precision (float64). I find that the NumPy performance is almost equivalent to the Intel MKL C++ implementation (used as a benchmark for matrix multiplication) for both single and double precision (DGEMM and SGEMM). But in TensorFlow, only the single precision (float32) performance is equivalent to the MKL, and the double precision (float64) performance is significantly slower. Why is Tensorflow slower when used with double precision data?

Sample Scripts:

I consider the following instance to reproduce my observation. Consider the matrix multiplication:

C = AB where A and B are of size 3000x3000

The TensorFlow2 and NumPy code are given below:

Tensorflow2 code

import tensorflow as tf
import os
import time


#Check if MKL is enabled
import tensorflow.python.framework as tff
print("MKL Enabled : ", tff.test_util.IsMklEnabled())


#Set threads
tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)

#Problem size
N = 3000
REPS = 20
DTYPE = tf.float64
#DTYPE = tf.float32


@tf.function
def gemm_implicit_noup(A, B):
    #C = A @ B
    start = tf.timestamp()
    with tf.control_dependencies([start]):
        C = tf.matmul(A,B)
    with tf.control_dependencies([C]):
        end = tf.timestamp()
    tf.print(end-start)
    return C

tf.config.run_functions_eagerly(False)

A = tf.random.normal([N, N], dtype=DTYPE)
B = tf.random.normal([N, N], dtype=DTYPE)


#Building Trace
C = gemm_implicit_noup(A,B)

for i in range(REPS):
   C = gemm_implicit_noup(A,B)

Numpy code

import os
os.environ["OMP_NUM_THREADS"] = "1"
import numpy as np
import time

N = 3000
REPS = 20
DTYPE = np.float64
#DTYPE = np.float32

def gemm_implicit_noup(A, B):
    #C = A @ B
    C = np.matmul(A,B)
    return C



A = np.random.randn(N,N).astype(DTYPE)
B = np.random.randn(N,N).astype(DTYPE)

for i in range(REPS):
   start = time.perf_counter()
   C = gemm_implicit_noup(A,B)
   end = time.perf_counter()
   print(end-start)

System and Installation settings:

The performance was compared on Intel Xeon Skylake 2.1 GHz with CentOS 7 and also on MacBook Pro 2018 with BigSur. The performance was compared on both Tensorflow 2.7 and 2.8, which were built with Intel MKL. Python 3.9.7 and 3.7.4 were checked. I compare the single thread performance so that the results can be reliably reproduced. I observe similar performance numbers in all the settings:

Single precision performance is as expected:

Intel MKL C++ SGEMM ~ 0.5s
NumPy float32 ~ 0.5s
TensorFlow float32 ~ 0.5s

But Double precision performance:

Intel MKL C++ DGEMM ~ 0.9s
NumPy float64 ~ 1s
TensorFlow float64 > 2.5s (Much Slower!!)

Solution

Assuming that you are using an Intel® AVX-512 instruction-supported processor, try installing the Intel® Optimization for TensorFlow Wheel via PIP specifically build for AVX512. These packages are available as *.whl on the Intel® website for specific Python versions or can be installed using the following command for Python versions 3.7, 3.8, and 3.9 (Linux Only).

pip install intel-tensorflow-avx512==2.7.0

This is documented on the official Intel® website and its sub-sections given in the below links:

Intel® Optimization for TensorFlow: Installation Guide

Intel® Optimization for TensorFlow: Install the Intel® Optimization for TensorFlow Wheel via PIP

AVX512 is a Single Instruction Multiple Data (SIMD) instruction set specifically designed to handle complex data types like double-precision numbers. In order to take full advantage of Intel® architecture and to extract maximum performance, the TensorFlow framework has been optimized using oneAPI Deep Neural Network Library (oneDNN) primitives, a popular performance library for deep learning applications. As an additional optimization step, also try setting the environment variable TF_ENABLE_ONEDNN_OPTS to 1 inside your Linux terminal using the following command before running the TensorFlow code:

export TF_ENABLE_ONEDNN_OPTS=1

The single-thread performance obtained for double-precision matrix-matrix products using the code that you provided is given below. This test is done on an Intel® Xeon® Platinum 8260M CPU @ 2.40GHz with Python 3.8 along with Intel® MKL and AVX512 optimized TensorFlow 2.7.

NumPy float64 ~ 1.44s
TensorFlow float64 (MKL Enabled) ~ 2.77s
TensorFlow float64 (MKL Enabled, AVX512 Optimized, oneDNN Optimization Enabled) ~ 1.19s