PyTorch .to(torch.device("cuda")) slow when first called?

My question is concerned with the speed of the to Method of PyTorch tensors and how it depends on the "execution state" (not sure if thats the correct name, feel free to edit).

My setup is as follows (RTX 2060 Super):

python version: 3.8.5 (default, Jul 28 2020, 12:59:40) [GCC 9.3.0]
pytorch version: 1.7.0+cu110

Firstly, a file that contains the code for what I'm talking about:

import torch
import time
import sys

gpu = torch.device("cuda")

def ver():
    print("python version:", sys.version)
    print("pytorch version:", torch.__version__)
    print("\n")

def test():
    start = time.time()
    torch.cuda.init()
    print("cuda init:", time.time()-start) 
    x = torch.randn(15000,3).float()
    print("randn initialized:", time.time()-start)
    x.to(gpu)
    print("to(gpu):", time.time()-start)
    torch.cuda.synchronize()
    print("time after sync:", time.time()-start)
    print("\n")

if __name__ == "__main__":
    ver()
    test()
    test()

Running this yields following console output:

python version: 3.8.5 (default, Jul 28 2020, 12:59:40) 
[GCC 9.3.0]
pytorch version: 1.7.0+cu110


cuda init: 0.002934694290161133
randn initialized: 0.0033266544342041016
to(gpu): 1.5724568367004395
time after sync: 1.5725233554840088


cuda init: 9.5367431640625e-07
randn initialized: 0.00030875205993652344
to(gpu): 0.00037860870361328125
time after sync: 0.00039458274841308594

The speed of the second to call is so much faster than the first one. Why does this occur? And more importantly, how can I achieve consistently fast speeds?
In a bigger project I'm working on I should have enough time to maybe initalize the GPU in parallel before to gets called for the first time. Is that possible? If so, how? torch.cuda.init() doesn't seem to change the speed of the first to.

I cannot use torch.randn(x,y, device=gpu) because in the original setup the data comes from torch.from_numpy().

Thanks.

Solution

The first run is always the slowest because it is loading everything on to the gpu. After the first run the times get much more consistent. If you run your test a few more times you should see the times are much closer to each other.