Search code examples
pythondeep-learningpytorchclassification

Low accuracy binary classification with Pytorch


In practicing deep learning for binary classification with Pytorch on Breast-Cancer-Wisconsin-Diagnostic-DataSet.

I've tried different approaches, and the best I can get as below, the accuracy is still low at 61%.

What's the way to improve the accuracy?

import pandas as pd
import io

dataset = pd.read_excel(base_dir + "Breast-Cancer-Wisconsin-Diagnostic.xlsx")

number_of_columns = dataset.shape[1]

# training and testing split of 70:30
dataset['diagnosis'] = pd.Categorical(dataset['diagnosis']).codes
dataset = dataset.sample(frac=1, random_state=1234)
train_input = dataset.values[:398, :number_of_columns-1]
train_target = dataset.values[:398, number_of_columns-1]
test_input = dataset.values[398:, :number_of_columns-1]
test_target = dataset.values[398:, number_of_columns-1]

import torch
torch.manual_seed(1234)
hidden_units = 5
net = torch.nn.Sequential(
torch.nn.Linear(number_of_columns-1, hidden_units),
torch.nn.ReLU(),
torch.nn.Linear(hidden_units, 2))

# choose optimizer and loss function
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.1,momentum=0.9)

# train
epochs = 50
for epoch in range(epochs):
    inputs = torch.autograd.Variable(torch.Tensor(train_input).float())
    targets = torch.autograd.Variable(torch.Tensor(train_target).long())
    optimizer.zero_grad()
    out = net(inputs)
    loss = criterion(out, targets)
    loss.backward()
    optimizer.step()
    if epoch == 0 or (epoch + 1) % 10 == 0:
        print('Epoch %d Loss: %.4f' % (epoch + 1, loss.item()))

# Epoch 1 Loss: 412063.1250
# Epoch 10 Loss: 0.6628
# Epoch 20 Loss: 0.6639
# Epoch 30 Loss: 0.6592
# Epoch 40 Loss: 0.6587
# Epoch 50 Loss: 0.6588


import numpy as np
inputs = torch.autograd.Variable(torch.Tensor(test_input).float())
targets = torch.autograd.Variable(torch.Tensor(test_target).long())
optimizer.zero_grad()
out = net(inputs)
_, predicted = torch.max(out.data, 1)
error_count = test_target.size - np.count_nonzero((targets == predicted).numpy())

print('Errors: %d; Accuracy: %d%%' % (error_count, 100 * torch.sum(targets == predicted) // test_target.size))

# Errors: 65; Accuracy: 61%

Solution

  • Features Representing samples are in different range. So, First thing you should do is to normalize the data.

    You should plot the loss and acc over the training epochs for training and validation/test dataset to understand whether the model overfits on training data or underfit.

    Furthermore, you can try with more complex (deeper) model. And since your training dataset has few number of samples, you can consider augmentation and transfer learning as well if possible.