I've written a Matrix Multiplication Kernel in SYCL, based on Tiling sub-matrices to local cache. The performance uplift I get with tiling (tile size 16x16) and without tiling (naive) approach is up to 2x.
For lower tile sizes, I get near to naive speeds, which is expected. For any tile size higher than 16 (and I would choose a power of 2 because so is my matrix size) like 32, the kernel throws a sycl exception.
I suspect this is because GPU cannot accommodate the higher tile-size on its local cache.
Questions:
I tried checking ark.intel.com, but that doesn't list the GPU local cache size. Current setup: i7-8665U with Intel UHD 620
P.S: If you would like to see my kernel code, please add a comment, I will add. I currently don't feel the need to show the kernel code and bloat the post.
@Artyom has given an explanation on things to take care of, while implementing Matrix Multiplication on GPU.
On the questions, here are the snippets in SYCL that show what I was looking for:
// Create a queue with device
default_selector d_selector;
queue q(d_selector, dpc_common::exception_handler);
std::cout << "Enumerated Device: "
<< q.get_device().get_info<info::device::name>() << "\n";
auto wgroup_size = q.get_device().get_info<info::device::max_work_group_size>();
auto local_mem_size = q.get_device().get_info<info::device::local_mem_size>();
auto global_mem_size = q.get_device().get_info<info::device::global_mem_size>();
std::cout << "Maximum workgroup size\t:" << wgroup_size << "\n"
<< "Global Memory Size\t:" << global_mem_size / 1024 / 1024 << " MB\n"
<< "Local Memory Size\t:" << local_mem_size / 1024 << " KB\n";
This shows:
Enumerated Device: Intel(R) Gen9
Maximum workgroup size :256
Global Memory Size :3199 MB
Local Memory Size :64 KB