How to interpose cuda runtime API used within tensorflow?

I have installed tensorflow (not from source) on an x86-64 Ubuntu computer with CUDA 10 and a suitable GPU. My goal is to intercept (using LD_PRELOAD) the CUDA runtime API for any tensorflow applications. Unfortunately for my use-case, I am not able to build tensorflow from source for my target machine which is not x86-64.

I am able to intercept the cudaLaunchKernel calls I make from a test program written in C++ that dynamically loads the cuda runtime API, and on first inspection I thought that python should similarly dynamically load the same cuda .so - I am confused because LD_PRELOAD is not working for a normally-installed tensorflow application running with cuda enabled.

I expect that cudaLaunchKernel calls within tensorflow should be intercepted by my interposition library that is LD_PRELOAD'd. Is this correct?

Solution

It appears that Tensorflow wrote stream_executor to avoid having to use CUDA's Runtime API, and instead wrap CUDA's Driver API (cuLaunchKernel) itself with opensource code. This is referenced in this Pull Request to Tensorflow to allow interposing of CUDA's Runtime API, which was rejected. As well, in TF source (see here), we see that cu*/driver API is actively being used instead of the runtime API.