CUDA forces OpenMP to run in a single-threaded mode

I wrote a CUDA SGEMM program and when I wanted to test the speed with a multi-threading CPU implementation, it failed to run in a multi-threading way. I isolated the CPU implementation in a sperate .cc file, built it and ran it and there was no problem. The code in .cu and isolated .cc is (the same):

void sgemm_cpu_multi_threading(
    float* A, float* B, float* C, 
    float alpha, float beta, const int M, const int N, const int K
) {
    #pragma omp parallel for num_threads(8)
    for (int m = 0; m < M; m++) {
        printf("%d thread(s) can be used\n", omp_get_num_threads());
        for (int n = 0; n < N; n++) {
            float psum = 0.0;
            for (int k = 0; k < K; k++) {
                psum += A[m * K + k] * B[k * N + n];
            }
            C[m * N + n] = C[m * N + n] * beta + psum * alpha;
        }
    }
}

int main() {
    omp_set_num_threads(OMP_THREADS);   // OMP_THREADS=8
    // ... 
    sgemm_cpu_multi_threading(A, B, C, alpha, beta, M, N, K);
    // ...
}

In the CUDA program, the printf always outputs "1 thread(s) can be used" and the execution is indeed serialized. While isolated pure cxx executable tells me "8 thread(s) can be used". The CMakeLists.txt of mine (CUDA). For the isolated cxx project, I only delete lines related to CUDA.

cmake_minimum_required(VERSION 3.10)

project(SGEMM CUDA CXX)

if(NOT CMAKE_BUILD_TYPE)
  set(CMAKE_BUILD_TYPE "Release")
endif()

SET(CMAKE_CXX_FLAGS_RELEASE "$ENV{CXXFLAGS} -O3 -Wall -fopenmp")
SET(COMPILE_CUDA True)

set(CMAKE_CXX_STANDARD 14)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CUDA_ARCHITECTURES 75)

find_package(OpenMP REQUIRED)
if (OPENMP_FOUND)
  set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
  set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
endif()

include_directories(
    ${CMAKE_SOURCE_DIR}/../
)

add_executable(matmul ./matmul.cu)

target_link_libraries(matmul pthread OpenMP::OpenMP_CXX)

Can somebody tell me what's going on here? Why can't I properly use OpenMP multi-threading in a CUDA program? BTW, if you wish to know, the A, B, C can be allocated using new or cudaMallocHost and it won't affect the fact that I can't not run the program with more than one thread, even if the CPU function should be seperated by nvcc and compile & run on CPU.

Solution

As mentioned by @RobertCrovella, my code is not compiled properly, with CMAKE_CUDA_FLAGS not properly set. Solution is quite straight-forward, use:

find_package(OpenMP REQUIRED)
if (OPENMP_FOUND)
  set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} ${OpenMP_CXX_FLAGS}")
endif()

instead of setting c and cxx flags in the CMake. I checked the content of ${OpenMP_CXX_FLAGS} via message and it writes -fopenmp (and nothing else), so one doesn't need to write -fopenmp again. Also, I can compile the code without -Xcompiler and it runs correctly. Should the problem persist, one can try adding this flag. Thanks for the comments by @RobertCrovella and @paleonix.