Can CUDA cores run things absolutely parallel or do they need context switching?

Can a CUDA INT32 Core process two different integer instructions completelly parallel, without context switching? I know that it is not possible on a CPU, but on a NVIDIA GPU? I know that a SM can run warps, and if core has to wait for some information, then a it gets another thread from the dispatch unit.

Solution

I know that it is not possible on a CPU, but on a NVIDIA GPU?

This assertion is wrong on modern mainstream CPUs (eg. since at least a decade for nearly all x86-64 processors, starting from Intel Skylake or AMD Zen 2). Indeed, modern x86-64 Intel/AMD processor can generally compute 2 (256 AVX) SIMD vectors in parallel since there is generally 2 SIMD units. Processors like Intel Skylake also have 4 ALU units capable of computing 4 basic arithmetic operations (eg. add, sub, and, xor) in parallel per cycle. Some instruction like division are far more expensive and do not run in parallel on such architecture though it is well pipelined. The instructions can come from the same thread on the same logical cores or possibly 2 threads (of possibly 2 different processes) scheduled on 2 logical cores without any context switches. Note that recent high-end ARM processors can also do this (even some mobile processors).

Can a CUDA INT32 Core process two different integer instructions completelly parallel, without context switching?

NVIDIA GPUs execute groups of threads known as warps in SIMT (Single Instruction, Multiple Thread) fashion. Thus, 1 instruction operate on 32 items in parallel (though, theoretically, an hardware can be free not to do that completely in parallel). A kernel execution basically contains many block and blocks are scheduled to SM. An SM can operate on many blocks concurrently so there is a massive amount of parallelism available.

Whether a specific GPU can execute two INT32 warp in parallel it is dependent of the target architecture, not CUDA itself. On modern Nvidia GPUs, each SM can be split in multiple partitions that can each execute instructions on blocks independently of the other partitions. For example, AFAIK, on a Pascal GP104, there is 20 SM and each SM has 4 partition capable of running SIMD instructions operating on 1 warp (32 items) at time. In practice, things can be a bit more complex on newer architectures. You can get more information here.