Search code examples
deep-learningpytorchregressionprediction

Deep learning predictions with similar values


I am getting many of the predictions with a similar value with deep learning, which generates a horizontal line in the correlation plot.

I generated a small dataset that can reproduce the problem (data) but my dataset is much larger. That's why the layers are so big, but I get the same problem if I adapt them to the size of this simplified case.

If I try to predict the target values with other algorithm like random forest I get a R of 0.4 with this small dataset. With the full dataset, if I run the deep learning method and afterwards I remove all the values form the horizontal line, I get a similar R as the one of random forest. I don't know why it is not predicting in the same way for the samples of the horizontal line. Do you have any clue?

This is a code that reproduces the problem and some correlation plots:

import torch, torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

var='target'
data = pd.read_csv('data800.csv', index_col=0)

train_dataset = data.sample(frac=0.8,random_state=1)
test_dataset = data.drop(train_dataset.index)

train_labels = train_dataset.pop(var)
test_labels = test_dataset.pop(var)

model = nn.Sequential(nn.Linear(train_dataset.shape[1], 1024), nn.ReLU(), nn.BatchNorm1d(1024),                   
                      nn.Linear(1024, 128), nn.ReLU(),  nn.BatchNorm1d(128),
                      nn.Linear(128, 64), nn.ReLU(),  nn.BatchNorm1d(64),
                      nn.Linear(64, 1))
optim = torch.optim.Adam(model.parameters(), 0.01)

for epoch in range(200):
    yhat = model(torch.tensor(train_dataset.values).to(torch.float32))
    loss = nn.MSELoss()(yhat.ravel(), torch.tensor(train_labels).to(torch.float32))
    optim.zero_grad()
    loss.backward()
    optim.step()
    yhatt=model(torch.tensor(test_dataset.values).to(torch.float32))
    yhatt = yhatt.detach().numpy()
    score = np.corrcoef(test_labels, yhatt.reshape(test_labels.shape))
    if epoch % 20 == 0:
        print('epoch', epoch, '| loss:', loss.item(), '| R:', score[0,1])

yhat = model(torch.tensor(test_dataset.values).to(torch.float32))

yhat = yhat.detach().numpy()
plt.scatter(test_labels, yhat)
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.axis('equal')
plt.axis('square')
_ = plt.plot([-1000, 1000], [-1000, 1000])
plt.show()

Example of a correlation plot with the data linked here

Example of a correlation plot with the full dataset


Solution

  • I think the issue was that each feature typically had a mixed distribution. ML algorithms generally work best when the features are symmetrically distributed and on a similar scale. I transformed the features to a uniform distribution by replacing each feature with its percentile. This flattens the distribution:

    enter image description here

    The model had better convergence. I then also tweaked the architecture. It was initially stepping up from ~50 features to 1024. I changed to a tapered architecture where it gradually scales down from the input feature size. That also improved the results. Final train RMSE was 0.14, and test set r=0.42. Code below.

    enter image description here

    import torch, torch.nn as nn
    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    
    var = 'target'
    data = pd.read_csv('data800.csv', index_col=0)
    
    train_dataset = data.sample(frac=0.8, random_state=1)
    test_dataset = data.drop(train_dataset.index)
    
    train_labels = train_dataset.pop(var)
    test_labels = test_dataset.pop(var)
    
    #Flatten distribution by replacing each value with its percentile
    train_dataset_transformed = train_dataset.copy()
    test_dataset_transformed = test_dataset.copy()
    for feature in train_dataset.columns:
        #Percentiles estimated from train data
        bin_res = 0.2
        eval_percentiles = np.arange(bin_res, 100, bin_res)
        percentiles = [
            np.percentile(train_dataset[feature], p)
            for p in eval_percentiles
        ]
    
        #Apply to both train and test data
        train_dataset_transformed[feature] = pd.cut(
            train_dataset[feature],
            bins=[-np.inf] + percentiles + [np.inf],
            labels=False
        ).astype(np.float32)
        
        test_dataset_transformed[feature] = pd.cut(
            test_dataset[feature],
            bins=[-np.inf] + percentiles + [np.inf],
            labels=False
        ).astype(np.float32)
    
    #Hist before and after:
    # plt.hist(train_dataset.iloc[:, 0])
    # plt.hist(train_dataset_transformed.iloc[:, 0], bins=100)
    n_feat = train_dataset.shape[1]
    
    model = nn.Sequential(
        nn.Linear(n_feat, n_feat), nn.ReLU(), nn.BatchNorm1d(n_feat),                   
        nn.Linear(n_feat, n_feat // 2), nn.ReLU(), nn.BatchNorm1d(n_feat // 2),                   
        # nn.Linear(n_feat // 2, n_feat // 2), nn.ReLU(),  nn.BatchNorm1d(n_feat // 2),
        nn.Linear(n_feat // 2, n_feat // 4), nn.ReLU(),  nn.BatchNorm1d(n_feat // 4),
        # nn.Linear(n_feat // 4, n_feat // 4), nn.ReLU(),  nn.BatchNorm1d(n_feat // 4),
        nn.Linear(n_feat // 4, 1)
    )
    
    optim = torch.optim.Adam(model.parameters(), 0.01)
    
    #Scale
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler().fit(train_dataset_transformed)
    
    X_train = scaler.transform(train_dataset_transformed)
    X_test = scaler.transform(test_dataset_transformed)
    
    #Convert to tensors
    X_train = torch.tensor(X_train).float()
    y_train = torch.tensor(train_labels.values).float()
    
    X_test = torch.tensor(X_test).float()
    y_test = torch.tensor(test_labels.values).float()
    
    torch.manual_seed(0)
    for epoch in range(1770):
        yhat = model(X_train)
    
        loss = nn.MSELoss()(yhat.ravel(), y_train)
        optim.zero_grad()
        loss.backward()
        optim.step()
    
        with torch.no_grad():
            yhatt = model(X_test)
            score = np.corrcoef(y_test, yhatt.ravel())
            if epoch % 30 == 0:
                print('epoch', epoch, '| loss:', loss.item(), '| R:', score[0, 1])
    
    yhat = model(X_test)
    yhat = yhat.detach().numpy()
    plt.scatter(test_labels, yhat)
    ax_lims = plt.gca().axis()
    plt.plot([0, 100], [0, 100], 'k:', label='y=x')
    plt.gca().axis(ax_lims)
    plt.xlabel('True Values')
    plt.ylabel('Predictions')
    plt.legend()