Estimating the optimal tiling size for GPU matrix computations

I've written a Matrix Multiplication Kernel in SYCL, based on Tiling sub-matrices to local cache. The performance uplift I get with tiling (tile size 16x16) and without tiling (naive) approach is up to 2x.

For lower tile sizes, I get near to naive speeds, which is expected. For any tile size higher than 16 (and I would choose a power of 2 because so is my matrix size) like 32, the kernel throws a sycl exception.

I suspect this is because GPU cannot accommodate the higher tile-size on its local cache.

Questions:

How do I determine dynamically (and set) the maximum tile size supported on deployment on different GPUs?
For Intel GPUs, how can I find out the maximum GPU local cache size?

I tried checking ark.intel.com, but that doesn't list the GPU local cache size. Current setup: i7-8665U with Intel UHD 620

P.S: If you would like to see my kernel code, please add a comment, I will add. I currently don't feel the need to show the kernel code and bloat the post.

Solution

@Artyom has given an explanation on things to take care of, while implementing Matrix Multiplication on GPU.

On the questions, here are the snippets in SYCL that show what I was looking for:

// Create a queue with device
default_selector d_selector;
queue q(d_selector, dpc_common::exception_handler);
std::cout << "Enumerated Device: " 
          << q.get_device().get_info<info::device::name>() << "\n";
auto wgroup_size = q.get_device().get_info<info::device::max_work_group_size>();
auto local_mem_size = q.get_device().get_info<info::device::local_mem_size>();
auto global_mem_size = q.get_device().get_info<info::device::global_mem_size>();

std::cout << "Maximum workgroup size\t:" << wgroup_size << "\n" 
        << "Global Memory Size\t:" << global_mem_size / 1024 / 1024 << " MB\n"
        << "Local Memory Size\t:" << local_mem_size / 1024 << " KB\n";

This shows:

Enumerated Device: Intel(R) Gen9
Maximum workgroup size  :256
Global Memory Size      :3199 MB
Local Memory Size       :64 KB

Maximum workgroup size is 256, i.e. across each dimension, 16 is maximum supported.
Local Cache Size is 65536 bytes (64KB). This is also confirmed here if anyone wants to look further.