Search code examples
c++torchblasintel-mkllibtorch

How to see details behind CPU-only Libtorch Matrix-Matrix multiplication routines?


I have downloaded the libtorch CPU-only version from the website and unzipped it.

Inside my .cpp application which uses libtorch, I write (I am using intel-mkl for other parts of the application, and I wish libtorch uses this as well):

    omp_set_num_threads(64);
    mkl_set_num_threads(64);

I then check:

    std::cout << "torch::get_num_threads() returns: " << torch::get_num_threads() << std::endl;

    std::cout << "omp_get_max_threads() returns: " << omp_get_max_threads() << std::endl;
    std::cout << "mkl_get_max_threads() returns: " << mkl_get_max_threads() << std::endl;

These all return 64.

(yes, I do have so many cores, I am on a HPC machine with 128 cores per node and I am launching 2 MPI processes per node).

I then perform std::complex<double> matrix-matrix multiplications via torch::matmul() function calls.

These multiplications, for me, seem to be slow.

How can I check that:

  1. Libtorch uses MKL behind the scenes
  2. Libtorch uses threads for its MM multiplications? Is my check from above guaranteeing that Libtorch uses more than 1 thread behind the scenes?

Thank you!

EDIT:

$ nm -a --demangle libtorch_cpu.so | grep 'zgemv'
000000000b2c5290 T mkl_blas_avx2_xzgemv
000000000b2356e0 T mkl_blas_avx512_xzgemv
000000000b34a430 T mkl_blas_avx_xzgemv
000000000da8d320 T mkl_blas_avx_zgemv_c
000000000da8c310 T mkl_blas_avx_zgemv_n
000000000da8b840 T mkl_blas_avx_zgemv_t
000000000b75bae0 T mkl_blas_cnr_def_xzgemv
000000000e706570 T mkl_blas_cnr_def_zgemv_c
000000000e7072f0 T mkl_blas_cnr_def_zgemv_c_any
000000000e704cc0 T mkl_blas_cnr_def_zgemv_n
000000000e7058f0 T mkl_blas_cnr_def_zgemv_n_any
000000000e703180 T mkl_blas_cnr_def_zgemv_t
000000000e703f00 T mkl_blas_cnr_def_zgemv_t_any
000000000b6a3120 T mkl_blas_def_xzgemv
000000000e5e8370 T mkl_blas_def_zgemv_c
000000000e5e90f0 T mkl_blas_def_zgemv_c_any
000000000e5e6ac0 T mkl_blas_def_zgemv_n
000000000e5e76f0 T mkl_blas_def_zgemv_n_any
000000000e5e4f80 T mkl_blas_def_zgemv_t
000000000e5e5d00 T mkl_blas_def_zgemv_t_any
000000000b3f17d0 T mkl_blas_mc3_xzgemv
000000000de1d390 T mkl_blas_mc3_zgemv_c
000000000de1c380 T mkl_blas_mc3_zgemv_n
000000000de1b8b0 T mkl_blas_mc3_zgemv_t
000000000b4d9a50 T mkl_blas_mc_xzgemv
000000000e1ca5f0 T mkl_blas_mc_zgemv_c
000000000e1c7490 T mkl_blas_mc_zgemv_n
000000000e1c4660 T mkl_blas_mc_zgemv_t
0000000007fc0940 T mkl_blas_xzgemv
0000000007dfd930 T mkl_blas_zgemv
0000000007ed8920 T mkl_blas_zgemv_omp
0000000007ed8030 t mkl_blas_zgemv_omp._omp_fn.0

$ nm -a --demangle libtorch_cpu.so | grep 'zgemm'
00000000188df120 B .gomp_critical_user_mkl_blas_zgemm_omp_acopy_la_cs
0000000007ce5920 T cblas_zgemm_batch
0000000007cf6970 T mkl_blas__zgemm
0000000007cf76a0 T mkl_blas__zgemm_batch
000000000b2752b0 T mkl_blas_avx2_xzgemm
000000000d89f650 T mkl_blas_avx2_xzgemm_acopiedbcopy
000000000b2769f0 T mkl_blas_avx2_xzgemm_bdz
000000000b273320 T mkl_blas_avx2_xzgemm_internal_team
000000000b276a00 T mkl_blas_avx2_xzgemm_par
000000000b29c9b0 T mkl_blas_avx2_xzgemmger
000000000b272d70 T mkl_blas_avx2_xzgemmt
000000000b277610 T mkl_blas_avx2_zgemm_api_support
000000000b2769c0 T mkl_blas_avx2_zgemm_blk_info_bdz
000000000d899c70 T mkl_blas_avx2_zgemm_copyac
000000000b2920d0 T mkl_blas_avx2_zgemm_copyac_htn
000000000b276b50 T mkl_blas_avx2_zgemm_copyan
000000000b2920b0 T mkl_blas_avx2_zgemm_copyan_htn
000000000b276b10 T mkl_blas_avx2_zgemm_copyat
000000000b2920c0 T mkl_blas_avx2_zgemm_copyat_htn
000000000d899c30 T mkl_blas_avx2_zgemm_copybc
000000000b2920a0 T mkl_blas_avx2_zgemm_copybc_htn
000000000b276ad0 T mkl_blas_avx2_zgemm_copybn
000000000b292080 T mkl_blas_avx2_zgemm_copybn_htn
000000000b276a90 T mkl_blas_avx2_zgemm_copybt
000000000b292090 T mkl_blas_avx2_zgemm_copybt_htn
000000000b2770a0 T mkl_blas_avx2_zgemm_free_bufs
000000000b292170 T mkl_blas_avx2_zgemm_freebufs
000000000b276a70 T mkl_blas_avx2_zgemm_get_blks_size
000000000b276bd0 T mkl_blas_avx2_zgemm_get_bufs
000000000b276e70 T mkl_blas_avx2_zgemm_get_bufs_pack
000000000b276a80 T mkl_blas_avx2_zgemm_get_bufs_size
000000000b276a20 T mkl_blas_avx2_zgemm_get_kernel
000000000b276a40 T mkl_blas_avx2_zgemm_get_kernel_version
000000000b276a30 T mkl_blas_avx2_zgemm_get_optimal_kernel
000000000b276da0 T mkl_blas_avx2_zgemm_get_size_bufs
000000000b292160 T mkl_blas_avx2_zgemm_getbufs
000000000b2769d0 T mkl_blas_avx2_zgemm_getbufs_bdz
000000000b2770c0 T mkl_blas_avx2_zgemm_initialize_buffers
000000000b276150 T mkl_blas_avx2_zgemm_initialize_kernel_info
000000000b2760a0 T mkl_blas_avx2_zgemm_initialize_strategy
000000000d8b00c0 T mkl_blas_avx2_zgemm_ker0
000000000d8b0040 T mkl_blas_avx2_zgemm_ker0_cnr
000000000f697e00 T mkl_blas_avx2_zgemm_kernel_0
000000000f69a800 T mkl_blas_avx2_zgemm_kernel_0_b0
000000000f69bc00 T mkl_blas_avx2_zgemm_kernel_0_b0_cnr
000000000b2769e0 T mkl_blas_avx2_zgemm_kernel_0_bdz
000000000f699300 T mkl_blas_avx2_zgemm_kernel_0_cnr
000000000b2760e0 T mkl_blas_avx2_zgemm_map_thread_to_kernel
000000000b29f6c0 T mkl_blas_avx2_zgemm_mscale
000000000d8b0020 T mkl_blas_avx2_zgemm_mscale_wrapper
000000000b276a50 T mkl_blas_avx2_zgemm_num_kernels
000000000b295810 T mkl_blas_avx2_zgemm_pst
000000000b276a60 T mkl_blas_avx2_zgemm_set_blks_size
000000000b276f70 T mkl_blas_avx2_zgemm_set_bufs_pack
000000000b2bfb40 T mkl_blas_avx2_zgemm_sm_01
000000000b2bf8d0 T mkl_blas_avx2_zgemm_sm_01_10
000000000d9bc870 T mkl_blas_avx2_zgemm_sm_02
000000000d9b3250 T mkl_blas_avx2_zgemm_sm_03
000000000d9a7be0 T mkl_blas_avx2_zgemm_sm_04
000000000d99a650 T mkl_blas_avx2_zgemm_sm_05
000000000d98bac0 T mkl_blas_avx2_zgemm_sm_06
000000000d97a6d0 T mkl_blas_avx2_zgemm_sm_07
000000000d966d00 T mkl_blas_avx2_zgemm_sm_08
000000000d951770 T mkl_blas_avx2_zgemm_sm_09
000000000d939e70 T mkl_blas_avx2_zgemm_sm_10
000000000f697200 T mkl_blas_avx2_zgemm_zccopy_down2_ea
000000000f695900 T mkl_blas_avx2_zgemm_zccopy_right6_ea
000000000d67d600 T mkl_blas_avx2_zgemm_zcopy_down2_ea
000000000d67b200 T mkl_blas_avx2_zgemm_zcopy_down6_ea
000000000d67aa00 T mkl_blas_avx2_zgemm_zcopy_right2_ea
000000000d679100 T mkl_blas_avx2_zgemm_zcopy_right6_ea
000000000b276a10 T mkl_blas_avx2_zgemm_zero_desc
000000000d92fef0 T mkl_blas_avx2_zgemmt_nobufs
000000000b1d9fd0 T mkl_blas_avx512_xzgemm
000000000d478600 T mkl_blas_avx512_xzgemm_acopiedbcopy
000000000b1db700 T mkl_blas_avx512_xzgemm_bdz
000000000b1d8150 T mkl_blas_avx512_xzgemm_internal_team
000000000b1db710 T mkl_blas_avx512_xzgemm_par
000000000b20ce30 T mkl_blas_avx512_xzgemmger
000000000b1d7ba0 T mkl_blas_avx512_xzgemmt
000000000b1dc320 T mkl_blas_avx512_zgemm_api_support
000000000b1db6d0 T mkl_blas_avx512_zgemm_blk_info_bdz
000000000d473110 T mkl_blas_avx512_zgemm_copyac
000000000b1f42d0 T mkl_blas_avx512_zgemm_copyac_htn
000000000b1db860 T mkl_blas_avx512_zgemm_copyan
000000000b1f42b0 T mkl_blas_avx512_zgemm_copyan_htn
000000000b1db820 T mkl_blas_avx512_zgemm_copyat
000000000b1f42c0 T mkl_blas_avx512_zgemm_copyat_htn
000000000d4730d0 T mkl_blas_avx512_zgemm_copybc
000000000b1f42a0 T mkl_blas_avx512_zgemm_copybc_htn
000000000b1db7e0 T mkl_blas_avx512_zgemm_copybn
000000000b1f4280 T mkl_blas_avx512_zgemm_copybn_htn
000000000b1db7a0 T mkl_blas_avx512_zgemm_copybt
000000000b1f4290 T mkl_blas_avx512_zgemm_copybt_htn
000000000b1dbdb0 T mkl_blas_avx512_zgemm_free_bufs
000000000b1f4370 T mkl_blas_avx512_zgemm_freebufs
000000000b1db780 T mkl_blas_avx512_zgemm_get_blks_size
000000000b1db8e0 T mkl_blas_avx512_zgemm_get_bufs
000000000b1dbb80 T mkl_blas_avx512_zgemm_get_bufs_pack
000000000b1db790 T mkl_blas_avx512_zgemm_get_bufs_size
000000000b1db730 T mkl_blas_avx512_zgemm_get_kernel
000000000b1db750 T mkl_blas_avx512_zgemm_get_kernel_version
000000000b1db740 T mkl_blas_avx512_zgemm_get_optimal_kernel
000000000b1dbab0 T mkl_blas_avx512_zgemm_get_size_bufs
000000000b1f4360 T mkl_blas_avx512_zgemm_getbufs
000000000b1db6e0 T mkl_blas_avx512_zgemm_getbufs_bdz
000000000b1dbdd0 T mkl_blas_avx512_zgemm_initialize_buffers
000000000b1dae70 T mkl_blas_avx512_zgemm_initialize_kernel_info
000000000b1dadc0 T mkl_blas_avx512_zgemm_initialize_strategy
000000000d481b70 T mkl_blas_avx512_zgemm_ker0
000000000d481af0 T mkl_blas_avx512_zgemm_ker0_cnr
000000000f37be00 T mkl_blas_avx512_zgemm_kernel_0
000000000f374e00 T mkl_blas_avx512_zgemm_kernel_0_b0
000000000f36de00 T mkl_blas_avx512_zgemm_kernel_0_b0_cnr
000000000b1db6f0 T mkl_blas_avx512_zgemm_kernel_0_bdz
000000000f366c00 T mkl_blas_avx512_zgemm_kernel_0_cnr
000000000b1dae00 T mkl_blas_avx512_zgemm_map_thread_to_kernel
000000000b20f7f0 T mkl_blas_avx512_zgemm_mscale
000000000d481ad0 T mkl_blas_avx512_zgemm_mscale_wrapper
000000000b1db760 T mkl_blas_avx512_zgemm_num_kernels
000000000b205840 T mkl_blas_avx512_zgemm_pst
000000000b1db770 T mkl_blas_avx512_zgemm_set_blks_size
000000000b1dbc80 T mkl_blas_avx512_zgemm_set_bufs_pack
000000000b22fac0 T mkl_blas_avx512_zgemm_sm_01
000000000b22f850 T mkl_blas_avx512_zgemm_sm_01_10
000000000d5b6a90 T mkl_blas_avx512_zgemm_sm_02
000000000d5ad2d0 T mkl_blas_avx512_zgemm_sm_03
000000000d5a18d0 T mkl_blas_avx512_zgemm_sm_04
000000000d5940d0 T mkl_blas_avx512_zgemm_sm_05
000000000d583fb0 T mkl_blas_avx512_zgemm_sm_06
000000000d571570 T mkl_blas_avx512_zgemm_sm_07
000000000d55c150 T mkl_blas_avx512_zgemm_sm_08
000000000d5446a0 T mkl_blas_avx512_zgemm_sm_09
000000000d529d40 T mkl_blas_avx512_zgemm_sm_10
000000000f365900 T mkl_blas_avx512_zgemm_zccopy_down4_ea
000000000f363000 T mkl_blas_avx512_zgemm_zccopy_right12_ea
000000000cf87200 T mkl_blas_avx512_zgemm_zcopy_down12_ea
000000000cf85f00 T mkl_blas_avx512_zgemm_zcopy_down4_ea
000000000cf83600 T mkl_blas_avx512_zgemm_zcopy_right12_ea
000000000cf82800 T mkl_blas_avx512_zgemm_zcopy_right4_ea
000000000b1db720 T mkl_blas_avx512_zgemm_zero_desc
000000000d4d7cf0 T mkl_blas_avx512_zgemmt_nobufs
000000000b301bf0 T mkl_blas_avx_xzgemm
000000000b2fb630 T mkl_blas_avx_xzgemm_bdz
000000000b2fb5a0 T mkl_blas_avx_xzgemm_internal
000000000b2fb5b0 T mkl_blas_avx_xzgemm_internal_team
000000000b2ff170 T mkl_blas_avx_xzgemm_par
000000000b3159e0 T mkl_blas_avx_xzgemmger
000000000b2fda20 T mkl_blas_avx_xzgemmt
000000000b2fe690 T mkl_blas_avx_zgemm_api_support
000000000b2fb600 T mkl_blas_avx_zgemm_blk_info_bdz
000000000dac39e0 T mkl_blas_avx_zgemm_copyac
000000000b3118a0 T mkl_blas_avx_zgemm_copyac_htn
000000000dac35f0 T mkl_blas_avx_zgemm_copyan
000000000b311880 T mkl_blas_avx_zgemm_copyan_htn
000000000dac2a00 T mkl_blas_avx_zgemm_copyat
000000000b311890 T mkl_blas_avx_zgemm_copyat_htn
000000000dac2630 T mkl_blas_avx_zgemm_copybc
000000000b311870 T mkl_blas_avx_zgemm_copybc_htn
000000000dac11f0 T mkl_blas_avx_zgemm_copybn
000000000b311850 T mkl_blas_avx_zgemm_copybn_htn
000000000dac0e90 T mkl_blas_avx_zgemm_copybt
000000000b311860 T mkl_blas_avx_zgemm_copybt_htn
000000000dac0e80 T mkl_blas_avx_zgemm_free_bufs
000000000b311940 T mkl_blas_avx_zgemm_freebufs
000000000dac0d10 T mkl_blas_avx_zgemm_get_blks_size
000000000f7fb410 T mkl_blas_avx_zgemm_get_bufs
000000000b2fb2a0 T mkl_blas_avx_zgemm_get_bufs_size
000000000dac0d00 T mkl_blas_avx_zgemm_get_kernel_version
000000000b2fe580 T mkl_blas_avx_zgemm_get_optimal_kernel
000000000b311930 T mkl_blas_avx_zgemm_getbufs
000000000b2fb610 T mkl_blas_avx_zgemm_getbufs_bdz
000000000b2fb650 T mkl_blas_avx_zgemm_initialize_buffers
000000000b2fb640 T mkl_blas_avx_zgemm_initialize_kernel_info
000000000dac0b30 T mkl_blas_avx_zgemm_ker0
000000000f809c30 T mkl_blas_avx_zgemm_ker0_pst
000000000f7eb770 T mkl_blas_avx_zgemm_kernel_0
000000000f7ea100 T mkl_blas_avx_zgemm_kernel_0_b0
000000000b2fb620 T mkl_blas_avx_zgemm_kernel_0_bdz
000000000b2fb590 T mkl_blas_avx_zgemm_map_thread_to_kernel
000000000b319450 T mkl_blas_avx_zgemm_mscale
000000000b342750 T mkl_blas_avx_zgemm_pst
000000000b2fe570 T mkl_blas_avx_zgemm_set_blks_size
000000000b33b520 T mkl_blas_avx_zgemm_sm_01
000000000b33b2b0 T mkl_blas_avx_zgemm_sm_01_10
000000000dcbbd90 T mkl_blas_avx_zgemm_sm_02
000000000dcb08e0 T mkl_blas_avx_zgemm_sm_03
000000000dca3910 T mkl_blas_avx_zgemm_sm_04
000000000dc94cc0 T mkl_blas_avx_zgemm_sm_05
000000000dc83790 T mkl_blas_avx_zgemm_sm_06
000000000dc6f640 T mkl_blas_avx_zgemm_sm_07
000000000dc58a60 T mkl_blas_avx_zgemm_sm_08
000000000dc401c0 T mkl_blas_avx_zgemm_sm_09
000000000dc25100 T mkl_blas_avx_zgemm_sm_10
000000000b2fe4b0 T mkl_blas_avx_zgemm_zero_desc
000000000b32fc10 T mkl_blas_avx_zgemmt_nobufs
000000000b71fce0 T mkl_blas_cnr_def_xzgemm
000000000b71f700 T mkl_blas_cnr_def_xzgemm_bdz
000000000e738030 T mkl_blas_cnr_def_xzgemm_brc
000000000b71f5e0 T mkl_blas_cnr_def_xzgemm_internal
000000000b71f5f0 T mkl_blas_cnr_def_xzgemm_internal_team
000000000b72ee00 T mkl_blas_cnr_def_xzgemm_par
000000000b737b70 T mkl_blas_cnr_def_xzgemmger
000000000b71eb30 T mkl_blas_cnr_def_xzgemmt
000000000b71f6f0 T mkl_blas_cnr_def_zgemm_api_support
000000000b71f620 T mkl_blas_cnr_def_zgemm_blk_info_bdz
000000000b756c10 T mkl_blas_cnr_def_zgemm_copyac
000000000e737f30 T mkl_blas_cnr_def_zgemm_copyac_bdz
000000000f9ba270 T mkl_blas_cnr_def_zgemm_copyac_brc
000000000b7569d0 T mkl_blas_cnr_def_zgemm_copyan
000000000e737f10 T mkl_blas_cnr_def_zgemm_copyan_bdz
000000000f9b9e80 T mkl_blas_cnr_def_zgemm_copyan_brc
000000000b7566c0 T mkl_blas_cnr_def_zgemm_copyat
000000000e737ef0 T mkl_blas_cnr_def_zgemm_copyat_bdz
000000000f9b9a80 T mkl_blas_cnr_def_zgemm_copyat_brc
000000000b756550 T mkl_blas_cnr_def_zgemm_copybc
000000000e737eb0 T mkl_blas_cnr_def_zgemm_copybc_bdz
000000000f9b9700 T mkl_blas_cnr_def_zgemm_copybc_brc
000000000b7563b0 T mkl_blas_cnr_def_zgemm_copybn
000000000e737e70 T mkl_blas_cnr_def_zgemm_copybn_bdz
000000000f9b9320 T mkl_blas_cnr_def_zgemm_copybn_brc
000000000b7562a0 T mkl_blas_cnr_def_zgemm_copybt
000000000e737e30 T mkl_blas_cnr_def_zgemm_copybt_bdz
000000000f9b8fe0 T mkl_blas_cnr_def_zgemm_copybt_brc
000000000b72edf0 T mkl_blas_cnr_def_zgemm_free_bufs
000000000e738020 T mkl_blas_cnr_def_zgemm_freebufs_bdz
000000000b72edc0 T mkl_blas_cnr_def_zgemm_get_blks_size
000000000f9b8f30 T mkl_blas_cnr_def_zgemm_get_blks_size_brc
000000000b72ee30 T mkl_blas_cnr_def_zgemm_get_bufs
000000000f9ba5b0 T mkl_blas_cnr_def_zgemm_get_bufs_brc
000000000b72ee10 T mkl_blas_cnr_def_zgemm_get_bufs_size
000000000b72ee20 T mkl_blas_cnr_def_zgemm_get_kernel
000000000b72edb0 T mkl_blas_cnr_def_zgemm_get_optimal_kernel
000000000e737f50 T mkl_blas_cnr_def_zgemm_getbufs_bdz
000000000b71f610 T mkl_blas_cnr_def_zgemm_initialize_buffers
000000000b71f600 T mkl_blas_cnr_def_zgemm_initialize_kernel_info
000000000e6c3c90 T mkl_blas_cnr_def_zgemm_inner
000000000e6bedf0 T mkl_blas_cnr_def_zgemm_inner_b_roll
000000000e6ba110 T mkl_blas_cnr_def_zgemm_inner_roll
000000000e6b54f0 T mkl_blas_cnr_def_zgemm_inner_z_roll
000000000e6a0a00 T mkl_blas_cnr_def_zgemm_kernel_0_bdz
000000000f996800 T mkl_blas_cnr_def_zgemm_kernel_0_brc
000000000e69fc20 T mkl_blas_cnr_def_zgemm_kernel_0_zen
000000000b71f5d0 T mkl_blas_cnr_def_zgemm_map_thread_to_kernel
000000000b755d10 T mkl_blas_cnr_def_zgemm_mscale
000000000b72ede0 T mkl_blas_cnr_def_zgemm_num_kernels
000000000b7502a0 T mkl_blas_cnr_def_zgemm_pst
000000000b74fe90 T mkl_blas_cnr_def_zgemm_scalm
000000000b72edd0 T mkl_blas_cnr_def_zgemm_set_blks_size
000000000f995c00 T mkl_blas_cnr_def_zgemm_zccopy_down2_bdz
000000000f994d00 T mkl_blas_cnr_def_zgemm_zccopy_right4_bdz
000000000f994100 T mkl_blas_cnr_def_zgemm_zcopy_down2_bdz
000000000f992b00 T mkl_blas_cnr_def_zgemm_zcopy_down4_bdz
000000000f992300 T mkl_blas_cnr_def_zgemm_zcopy_right2_bdz
000000000f991400 T mkl_blas_cnr_def_zgemm_zcopy_right4_bdz
000000000b72eda0 T mkl_blas_cnr_def_zgemm_zero_desc
000000000b74fd60 T mkl_blas_cnr_def_zgemm_zerom
000000000b74a420 T mkl_blas_cnr_def_zgemmt_nobufs
000000000b6634f0 T mkl_blas_def_xzgemm
000000000b662f10 T mkl_blas_def_xzgemm_bdz
000000000e61a000 T mkl_blas_def_xzgemm_brc
000000000b6620d0 T mkl_blas_def_xzgemm_internal
000000000b6620e0 T mkl_blas_def_xzgemm_internal_team
000000000b672610 T mkl_blas_def_xzgemm_par
000000000b67b290 T mkl_blas_def_xzgemmger
000000000b662390 T mkl_blas_def_xzgemmt
000000000b662f00 T mkl_blas_def_zgemm_api_support
000000000b662e30 T mkl_blas_def_zgemm_blk_info_bdz
000000000b69e010 T mkl_blas_def_zgemm_copyac
000000000e619f00 T mkl_blas_def_zgemm_copyac_bdz
000000000f96da70 T mkl_blas_def_zgemm_copyac_brc
000000000b69ddd0 T mkl_blas_def_zgemm_copyan
000000000e619ee0 T mkl_blas_def_zgemm_copyan_bdz
000000000f96d680 T mkl_blas_def_zgemm_copyan_brc
000000000b69d9d0 T mkl_blas_def_zgemm_copyat
000000000e619ec0 T mkl_blas_def_zgemm_copyat_bdz
000000000f96d280 T mkl_blas_def_zgemm_copyat_brc
000000000b69d860 T mkl_blas_def_zgemm_copybc
000000000e619e80 T mkl_blas_def_zgemm_copybc_bdz
000000000f96cf00 T mkl_blas_def_zgemm_copybc_brc
000000000b69d6c0 T mkl_blas_def_zgemm_copybn
000000000e619e40 T mkl_blas_def_zgemm_copybn_bdz
000000000f96cb20 T mkl_blas_def_zgemm_copybn_brc
000000000b69d5b0 T mkl_blas_def_zgemm_copybt
000000000e619e00 T mkl_blas_def_zgemm_copybt_bdz
000000000f96c7e0 T mkl_blas_def_zgemm_copybt_brc
000000000b672600 T mkl_blas_def_zgemm_free_bufs
000000000e619ff0 T mkl_blas_def_zgemm_freebufs_bdz
000000000b6725d0 T mkl_blas_def_zgemm_get_blks_size
000000000f96c730 T mkl_blas_def_zgemm_get_blks_size_brc
000000000b672640 T mkl_blas_def_zgemm_get_bufs
000000000f96ddb0 T mkl_blas_def_zgemm_get_bufs_brc
000000000b672620 T mkl_blas_def_zgemm_get_bufs_size
000000000b672630 T mkl_blas_def_zgemm_get_kernel
000000000b6725c0 T mkl_blas_def_zgemm_get_optimal_kernel
000000000e619f20 T mkl_blas_def_zgemm_getbufs_bdz
000000000b662100 T mkl_blas_def_zgemm_initialize_buffers
000000000b6620f0 T mkl_blas_def_zgemm_initialize_kernel_info
000000000e5a5a90 T mkl_blas_def_zgemm_inner
000000000e5a0bf0 T mkl_blas_def_zgemm_inner_b_roll
000000000e59bf10 T mkl_blas_def_zgemm_inner_roll
000000000e5972f0 T mkl_blas_def_zgemm_inner_z_roll
000000000e5828e0 T mkl_blas_def_zgemm_kernel_0_bdz
000000000f94a000 T mkl_blas_def_zgemm_kernel_0_brc
000000000e581b00 T mkl_blas_def_zgemm_kernel_0_zen
000000000b6620c0 T mkl_blas_def_zgemm_map_thread_to_kernel
000000000b67e0c0 T mkl_blas_def_zgemm_mscale
000000000b6725f0 T mkl_blas_def_zgemm_num_kernels
000000000b697b40 T mkl_blas_def_zgemm_pst
000000000b697730 T mkl_blas_def_zgemm_scalm
000000000b6725e0 T mkl_blas_def_zgemm_set_blks_size
000000000f949400 T mkl_blas_def_zgemm_zccopy_down2_bdz
000000000f948500 T mkl_blas_def_zgemm_zccopy_right4_bdz
000000000f947900 T mkl_blas_def_zgemm_zcopy_down2_bdz
000000000f946300 T mkl_blas_def_zgemm_zcopy_down4_bdz
000000000f945b00 T mkl_blas_def_zgemm_zcopy_right2_bdz
000000000f944c00 T mkl_blas_def_zgemm_zcopy_right4_bdz
000000000b6725b0 T mkl_blas_def_zgemm_zero_desc
000000000b697600 T mkl_blas_def_zgemm_zerom
000000000b691cc0 T mkl_blas_def_zgemmt_nobufs
0000000007d0b540 T mkl_blas_errchk_zgemm
0000000007d0b750 T mkl_blas_errchk_zgemm_batch
0000000007d08b60 T mkl_blas_errchk_zgemm_batch_ilp64
0000000007d08940 T mkl_blas_errchk_zgemm_ilp64
000000000b3be250 T mkl_blas_mc3_xzgemm
000000000b3c8600 T mkl_blas_mc3_xzgemm_abcopied_htn
000000000b3c85e0 T mkl_blas_mc3_xzgemm_acopied_htn
000000000b3c85f0 T mkl_blas_mc3_xzgemm_bcopied_htn
000000000b3b9750 T mkl_blas_mc3_xzgemm_bdz
000000000b3b96c0 T mkl_blas_mc3_xzgemm_internal
000000000b3b96d0 T mkl_blas_mc3_xzgemm_internal_team
000000000b3bb7d0 T mkl_blas_mc3_xzgemm_par
000000000b3cc290 T mkl_blas_mc3_xzgemmger
000000000b3ba3a0 T mkl_blas_mc3_xzgemmt
000000000b3bb000 T mkl_blas_mc3_zgemm_api_support
000000000b3b9720 T mkl_blas_mc3_zgemm_blk_info_bdz
000000000b3c85d0 T mkl_blas_mc3_zgemm_blk_info_htn
000000000dfd13b0 T mkl_blas_mc3_zgemm_copyac
000000000b3c85b0 T mkl_blas_mc3_zgemm_copyac_htn
000000000dfd0ed0 T mkl_blas_mc3_zgemm_copyan
000000000b3c8570 T mkl_blas_mc3_zgemm_copyan_htn
000000000dfd0cf0 T mkl_blas_mc3_zgemm_copyat
000000000b3c8580 T mkl_blas_mc3_zgemm_copyat_htn
000000000dfd0860 T mkl_blas_mc3_zgemm_copybc
000000000b3c85c0 T mkl_blas_mc3_zgemm_copybc_htn
000000000dfd0390 T mkl_blas_mc3_zgemm_copybn
000000000b3c8590 T mkl_blas_mc3_zgemm_copybn_htn
000000000dfcff60 T mkl_blas_mc3_zgemm_copybt
000000000b3c85a0 T mkl_blas_mc3_zgemm_copybt_htn
000000000de44fc0 T mkl_blas_mc3_zgemm_free_bufs
000000000f873d10 T mkl_blas_mc3_zgemm_get_blks_size
000000000f873940 T mkl_blas_mc3_zgemm_get_bufs
000000000b3b93c0 T mkl_blas_mc3_zgemm_get_bufs_size
000000000de44fb0 T mkl_blas_mc3_zgemm_get_kernel_version
000000000b3baef0 T mkl_blas_mc3_zgemm_get_optimal_kernel
000000000b3b9730 T mkl_blas_mc3_zgemm_getbufs_bdz
000000000b3b9770 T mkl_blas_mc3_zgemm_initialize_buffers
000000000b3b9760 T mkl_blas_mc3_zgemm_initialize_kernel_info
000000000f8737a0 T mkl_blas_mc3_zgemm_ker0
000000000f87e780 T mkl_blas_mc3_zgemm_ker0_pst
000000000fe26670 T mkl_blas_mc3_zgemm_kernel_0_0
000000000fe25130 T mkl_blas_mc3_zgemm_kernel_0_1
000000000b3b9740 T mkl_blas_mc3_zgemm_kernel_0_bdz
000000000b3b96b0 T mkl_blas_mc3_zgemm_map_thread_to_kernel
000000000b3cef20 T mkl_blas_mc3_zgemm_mscale
000000000b3eb880 T mkl_blas_mc3_zgemm_pst
000000000b3b93b0 T mkl_blas_mc3_zgemm_set_blks_size
000000000b3e8160 T mkl_blas_mc3_zgemm_sm_01
000000000b3e7ef0 T mkl_blas_mc3_zgemm_sm_01_10
000000000dfcc600 T mkl_blas_mc3_zgemm_sm_02
000000000dfc7bd0 T mkl_blas_mc3_zgemm_sm_03
000000000dfc1f40 T mkl_blas_mc3_zgemm_sm_04
000000000dfbae70 T mkl_blas_mc3_zgemm_sm_05
000000000dfb2d50 T mkl_blas_mc3_zgemm_sm_06
000000000dfa9780 T mkl_blas_mc3_zgemm_sm_07
000000000df9ed70 T mkl_blas_mc3_zgemm_sm_08
000000000df92e80 T mkl_blas_mc3_zgemm_sm_09
000000000df859b0 T mkl_blas_mc3_zgemm_sm_10
000000000b3bae30 T mkl_blas_mc3_zgemm_zero_desc
000000000b3df7d0 T mkl_blas_mc3_zgemmt_nobufs
000000000b4552b0 T mkl_blas_mc_xzgemm
000000000b44f620 T mkl_blas_mc_xzgemm_bdz
000000000b44f590 T mkl_blas_mc_xzgemm_internal
000000000b44f5a0 T mkl_blas_mc_xzgemm_internal_team
000000000b4527d0 T mkl_blas_mc_xzgemm_par
000000000b46e390 T mkl_blas_mc_xzgemmger
000000000b44f970 T mkl_blas_mc_xzgemmt
000000000b4508e0 T mkl_blas_mc_zgemm_api_support
000000000b44f5f0 T mkl_blas_mc_zgemm_blk_info_bdz
000000000e1ebba0 T mkl_blas_mc_zgemm_copya_ext
000000000e3a51e0 T mkl_blas_mc_zgemm_copyac
000000000e3a4da0 T mkl_blas_mc_zgemm_copyac_htn
000000000e3a4740 T mkl_blas_mc_zgemm_copyan
000000000e3a40e0 T mkl_blas_mc_zgemm_copyan_htn
000000000e3a3cf0 T mkl_blas_mc_zgemm_copyat
000000000e3a3900 T mkl_blas_mc_zgemm_copyat_htn
000000000e1ebb80 T mkl_blas_mc_zgemm_copyb_ext
000000000e3a33f0 T mkl_blas_mc_zgemm_copybc
000000000e3a3050 T mkl_blas_mc_zgemm_copybc_htn
000000000e3a2bc0 T mkl_blas_mc_zgemm_copybn
000000000e3a2710 T mkl_blas_mc_zgemm_copybn_htn
000000000e3a2250 T mkl_blas_mc_zgemm_copybt
000000000e3a1dc0 T mkl_blas_mc_zgemm_copybt_htn
000000000e1ebb70 T mkl_blas_mc_zgemm_free_bufs
000000000e1eba40 T mkl_blas_mc_zgemm_get_blks_size
000000000e1eb940 T mkl_blas_mc_zgemm_get_blks_size_htn
000000000e1eb570 T mkl_blas_mc_zgemm_get_bufs
000000000b4505f0 T mkl_blas_mc_zgemm_get_bufs_size
000000000e1eb460 T mkl_blas_mc_zgemm_get_kernel
000000000e1eb450 T mkl_blas_mc_zgemm_get_kernel_version
000000000b4504d0 T mkl_blas_mc_zgemm_get_optimal_kernel
000000000b44f600 T mkl_blas_mc_zgemm_getbufs_bdz
000000000f89c050 T mkl_blas_mc_zgemm_htn_ker0_0_0
000000000f89b890 T mkl_blas_mc_zgemm_htn_ker0_0_1
000000000f903c30 T mkl_blas_mc_zgemm_htn_ker0_pst
000000000b44f640 T mkl_blas_mc_zgemm_initialize_buffers
000000000b44f630 T mkl_blas_mc_zgemm_initialize_kernel_info
000000000e1eb0d0 T mkl_blas_mc_zgemm_ker0
000000000f88ff40 T mkl_blas_mc_zgemm_ker0_full
000000000f8898f0 T mkl_blas_mc_zgemm_ker0_general
000000000e1eb290 T mkl_blas_mc_zgemm_ker0_htn
000000000f902c90 T mkl_blas_mc_zgemm_ker0_pst
000000000b44f610 T mkl_blas_mc_zgemm_kernel_0_bdz
000000000b44f580 T mkl_blas_mc_zgemm_map_thread_to_kernel
000000000b471610 T mkl_blas_mc_zgemm_mscale
000000000b4d3510 T mkl_blas_mc_zgemm_pst
000000000b4504c0 T mkl_blas_mc_zgemm_set_blks_size
000000000b4cf670 T mkl_blas_mc_zgemm_sm_01
000000000b4cf400 T mkl_blas_mc_zgemm_sm_01_10
000000000e39dc50 T mkl_blas_mc_zgemm_sm_02
000000000e398820 T mkl_blas_mc_zgemm_sm_03
000000000e391ef0 T mkl_blas_mc_zgemm_sm_04
000000000e389fd0 T mkl_blas_mc_zgemm_sm_05
000000000e380f60 T mkl_blas_mc_zgemm_sm_06
000000000e376720 T mkl_blas_mc_zgemm_sm_07
000000000e36aa30 T mkl_blas_mc_zgemm_sm_08
000000000e35d530 T mkl_blas_mc_zgemm_sm_09
000000000e34e8c0 T mkl_blas_mc_zgemm_sm_10
000000000b450400 T mkl_blas_mc_zgemm_zero_desc
000000000b4c6010 T mkl_blas_mc_zgemmt_nobufs
0000000007fc1510 T mkl_blas_xzgemm
0000000007fc1730 T mkl_blas_xzgemm_bdz
0000000007fc0fd0 T mkl_blas_xzgemm_internal_team
0000000007fc0d80 T mkl_blas_xzgemm_par
0000000007fc1300 T mkl_blas_xzgemmger
0000000007fc0b40 T mkl_blas_xzgemmt
0000000007dad2f0 T mkl_blas_zgemm
0000000007ee7c20 T mkl_blas_zgemm_1D_col
0000000007ee78f0 T mkl_blas_zgemm_1D_row
0000000007eeaee0 T mkl_blas_zgemm_1D_with_copy_0
0000000007ee8580 T mkl_blas_zgemm_2D_abcopy_abx_m_km_par_p
0000000007ee7f50 T mkl_blas_zgemm_2D_bcopy
0000000007ee6f60 T mkl_blas_zgemm_2D_bsrc
0000000007ee7280 T mkl_blas_zgemm_2D_improved_bsrc
0000000007eeb830 T mkl_blas_zgemm_2D_xgemm_p
0000000007fc00a0 T mkl_blas_zgemm_api_support
0000000007db4110 T mkl_blas_zgemm_batch
0000000007fbfec0 T mkl_blas_zgemm_blk_info_bdz
0000000007fbfd10 T mkl_blas_zgemm_get_bufs_size
0000000007fbfbf0 T mkl_blas_zgemm_get_optimal_kernel
0000000007fbfa70 T mkl_blas_zgemm_initialize_buffers
0000000007fbf8c0 T mkl_blas_zgemm_initialize_kernel_info
0000000007fbf780 T mkl_blas_zgemm_map_thread_to_kernel
0000000007fbf5f0 T mkl_blas_zgemm_mscale
0000000007eea070 T mkl_blas_zgemm_omp_driver_v1
0000000007ee9310 t mkl_blas_zgemm_omp_driver_v1._omp_fn.0
0000000007ee9470 t mkl_blas_zgemm_omp_driver_v1._omp_fn.1
0000000007ee9c40 t mkl_blas_zgemm_omp_driver_v1._omp_fn.2
0000000007ee9c30 T mkl_blas_zgemm_omp_free_prototype_memory
0000000007ee95d0 T mkl_blas_zgemm_omp_get_prototype
0000000007fbf470 T mkl_blas_zgemm_set_blks_size
0000000007eeb5e0 T mkl_blas_zgemm_xgemm_external_omp
0000000007fbf350 T mkl_blas_zgemm_zero_desc
0000000007dfdcd0 T mkl_blas_zgemmger
0000000007ed8c80 T mkl_blas_zgemmger_omp
0000000007ed8b40 t mkl_blas_zgemmger_omp._omp_fn.0
0000000007da3d70 T mkl_blas_zgemmt
0000000007ee66c0 T mkl_blas_zgemmt_omp_driver_v1
00000000080e0060 t mkl_lapack_zgemm_team
0000000007cf6970 T zgemm
0000000007cf6970 T zgemm_
0000000007cf7030 T zgemm_64
0000000007cf7030 T zgemm_64_
0000000007cf76a0 T zgemm_batch
0000000007cf76a0 T zgemm_batch_
0000000007cf7c00 T zgemm_batch_64
0000000007cf7c00 T zgemm_batch_64_

Solution

  • Broadly speaking, there are a few different ways of doing this at different levels of complexity. You can:

    1. Check that libtorch links to intel-mkl using ldd libtorch or a similar command line.
    2. Check that symbols for particular intel-mkl routines you're interested in are referenced in libtorch, using a command like nm -a --demangle <some path to libtorch.*.so> | grep 'mkl'.
    3. Profile your specific call to torch::matmul() at runtime, using a tool like VTune, oprofile, or perf.

    (1) is an extremely basic sanity check that won't tell you much other than that the specific libtorch library you're using is at least going to try to pull some symbols in from intel-mkl. It can help with debugging whether you're using the version of libtorch that you expected and that that version of the library is able to traverse the system's library path to find the version of intel-mkl that it should be. This is rarely an issue, but is easy and quick to check, and you don't want to spend longer than you have to debugging what could turn out to be, for example, an incorrect installation of MKL. (2) will tell you a little bit more about which MKL routines your copy of libtorch is referencing statically - although notably not what codepaths it takes for those to end up being called at runtime.

    (3) is almost certainly what you want, and will answer in much greater depth why those calls seem slow. I'm guessing that if you're using MKL, VTune might also be available (for AMD, consider the similar tool μProf). If you're on a system with Intel CPUs, VTune will also have detailed knowledge of those microarchitectures built into it, provided that the version of VTune you're using was released after the processor you're running on (again, likely). A VTune callstack report can tell you what paths in your code are calling down to libtorch and in turn which if any of those end up in particular MKL routines, and then which of those are taking up a lot of time and why. VTune thread analysis may also tell you if your OMP/MKL/torch threads have well-balanced datasets to work on, whether other threads you're not thinking about are taking up time or getting them unscheduled, and so on. Similar analyses are available with other tools but may just be more time consuming to put together and visualize.

    There are also trickier, more time consuming things you can do to hook your own profiling and logging into assembled code like what's in MKL. It's unlikely based on what's been stated so far that this is what you need, but is a possible option of last resort if existing tools are not doing what you need. If you strongly believe you need techniques like this and it will be worth your effort, A study of Binary Instrumentation techniques by Soumyakant Priyadarshan, Intel's PIN dynamic binary instrumentation or similar may be good resources.