Why is PyTorch significantly slower than Tensorflow? Both are running CUDA

I was just testing out the speeds between tensorflow 2.0 and pytorch 2.0 (since I'm just starting to learn pytorch), and from what I've realized, with the exact same model architecture, same batch size, same optimizer, and both are running CUDA, the pytorch one takes about 3 times as long to run (1 minute of tf vs 3 minutes of pyt), also with a worse accuracy (83% for the validation set for tf and 78% for pyt).

Here is the code for tensorflow:

(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.fashion_mnist.load_data()
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(512, activation="relu"),
    tf.keras.layers.Dense(512, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(optimizer=tf.keras.optimizers.SGD(1e-3), loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(train_images, train_labels, epochs=20, batch_size=64, validation_data=(test_images, test_labels))

And here is the code for pytorch:

device = torch.device("cuda")

training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

train_dataloader = DataLoader(training_data, batch_size=64)
test_dataloader = DataLoader(test_data, batch_size=64)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork()
model.to(device)

learning_rate = 1e-3
batch_size = 64
epochs = 5
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)

    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)
        pred = model(X)
        loss = loss_fn(pred, y)

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def test_loop(dataloader, model, loss_fn):
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

epochs = 20
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    test_loop(test_dataloader, model, loss_fn)
print("Done!")

What I've noticed is that tensorflow uses about 60% of my CUDA, and all of my dedicated GPU memory, while pytorch goes on and off between 0% and 30%, and not using much GPU memory at all.

This post exists, but that was due to CUDA_LAUNCH_BLOCKING, which is not present in my code.

Specs:

RTX 3060
Intel i710700, overclocked to ~4.2GHz
64GB memory

Edit: I've tried upping the workers and pinning memory for Dataloaders, but it only saved around 15 seconds for 20 epochs, and torch.backends.cudnn.benchmark = True doesn't help at all.

Solution

While I was running a test to optimize num_workers outlined in this medium article, I figured that it my DataLoader that is slowing me down. After looking through the documentation, I came across persistent_workers=True, which solved the problem for me.

The code went from ~3 minutes to under ~25 seconds.