RunTimeError during one hot encoding

I have a dataset where class values go from -2 to 2 by 1 step (i.e., -2,-1,0,1,2) and where 9 identifies the unlabelled data. Using one hot encode

self._one_hot_encode(labels)

I get the following error: RuntimeError: index 1 is out of bounds for dimension 1 with size 1

due to

self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)

The error should raise from [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1], where I have 9 in the mapping setting equal index 9 to 1. It is unclear to me how to fix it, even after going through past questions and answers to similar problems (e.g., index 1 is out of bounds for dimension 0 with size 1). The part of code involved in the error is the following:

def _one_hot_encode(self, labels):
    # Get the number of classes
    classes = torch.unique(labels)
    classes = classes[classes != 9] # unlabelled 
    self.n_classes = classes.size(0)

    # One-hot encode labeled data instances and zero rows corresponding to unlabeled instances
    unlabeled_mask = (labels == 9)
    labels = labels.clone()  # defensive copying
    labels[unlabeled_mask] = 0
    self.one_hot_labels = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)
    self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)
    self.one_hot_labels[unlabeled_mask, 0] = 0

    self.labeled_mask = ~unlabeled_mask

def fit(self, labels, max_iter, tol):
    
    self._one_hot_encode(labels)

    self.predictions = self.one_hot_labels.clone()
    prev_predictions = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)

    for i in range(max_iter):
        # Stop iterations if the system is considered at a steady state
        variation = torch.abs(self.predictions - prev_predictions).sum().item()
        

        prev_predictions = self.predictions
        self._propagate()

Example of dataset:

ID  Target  Weight  Label   Score   Scale_Cat   Scale_num
0   A   D   65.1    1   87  Up  1
1   A   X   35.8    1   87  Up  1
2   B   C   34.7    1   37.5    Down    -2
3   B   P   33.4    1   37.5    Down    -2
4   C   B   33.1    1   37.5    Down    -2
5   S   X   21.4    0   12.5    NA  9

The source code I am using as reference is here: https://mybinder.org/v2/gh/thibaudmartinez/label-propagation/master?filepath=notebook.ipynb

Full track of the error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-126-792a234f63dd> in <module>
      4 label_propagation = LabelPropagation(adj_matrix_t)
----> 6 label_propagation.fit(labels_t) # causing error
      7 label_propagation_output_labels = label_propagation.predict_classes()
      8 

<ipython-input-115-54a7dbc30bd1> in fit(self, labels, max_iter, tol)
    100 
    101     def fit(self, labels, max_iter=1000, tol=1e-3):
--> 102         super().fit(labels, max_iter, tol)
    103 
    104 ## Label spreading

<ipython-input-115-54a7dbc30bd1> in fit(self, labels, max_iter, tol)
     58             Convergence tolerance: threshold to consider the system at steady state.
     59         """
---> 60         self._one_hot_encode(labels)
     61 
     62         self.predictions = self.one_hot_labels.clone()

<ipython-input-115-54a7dbc30bd1> in _one_hot_encode(self, labels)
     42         labels[unlabeled_mask] = 0
     43         self.one_hot_labels = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)
---> 44         self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)
     45         self.one_hot_labels[unlabeled_mask, 0] = 0
     46 

RuntimeError: index 1 is out of bounds for dimension 1 with size 1

Solution

I ran through your notebook (I think you changed the 9 to -1 for things to run) and saw that for this part of the code:

# Learn with Label Propagation
label_propagation = LabelPropagation(adj_matrix_t)
print("Label Propagation: ", end="")
label_propagation.fit(labels_t)
label_propagation_output_labels = label_propagation.predict_classes()

Which eventually calls:

self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)

Is where things were going wrong.

Take a brief moment to read the pytorch manual on scatter here: torch Scatter and we learn that for scatter it's important to understand the dim, index, src and self matrixes. For one hot encoding, dim=1 or 0 doesn't matter and our src matrix is 1 (We'll look a little more into this later). You are now calling scatter on dimension 1 with an index matrix of [40,1] and a result(self) matrix of [40,5].

I see two issues here:

You are using the literal category dummy variables (-2,-1,0,1,2) as the encoding indexes in your index matrix. Which will lead scatter to search for these indices in the src matrix. This is where the index out of bounds in coming from
You mention that there are 6 classes of -2,-1,0,1,2 and 9 for unlabelled but you are one hot encoding on 5 classes. (Yes, I know you want the unlabeled class to be all zeros but that's a little difficult to achieve with scatter. I'll explain later).

So how do we fix this?

Issue 1: Let's start with a small example:

index = torch.tensor([[5],[0],[3],[5],[1],[4]]); print(index.shape); print(index)
result = torch.zeros(6, 6, dtype=src.dtype).scatter_(1, index, src); print(result.shape); print(result)

This will give us

torch.Size([6, 1])
tensor([[5],
        [0],
        [3],
        [5],
        [1],
        [4]])
torch.Size([6, 6])
tensor([[0, 0, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 1],
        [0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]])

Index matrix is 6 observations with 1 observed value (category) Self matrix is 6 observations with a 6 category one hot encoding vector The way that scatter(dim=1) creates the self matrix is torch first checks the row (observation) and then changes the value of that row to the value of the value stored in the src matrix at the same row but at the column of the value stored in index.

self[i][index[i][j][k]][k] = src[i][j][k]

So in your case you were trying to apply the value of 1 into a row in self[40,1] at the column of index[0](which is equal to 1). Giving you the error in the question. Although I checked your notebook and the error is index -1 is out of bounds for dimension 1 with size 5. They are both the same root cause.

Issue 2: One-hot-encoding

It is just easier to do complete one-hot instead of one-hot with cold encodings in this case. The reason being is that for one-hot with cold encodings, you need to create a 0 value in your src matrix for every unlabelled observation. Which is much more painful than just using a 1 for the src. Also reading this link: Is it valid to have full zeros for OHE? I think it makes more sense to use one-hot for every category.

So, for the second issue we just need to simply map the categories in the indexes of the result/self matrix. Since we have 6 categories we just need to map them into 0,1,2,3,4,5. A simple lambda function would do the trick. I used a random sampler to get my data labels from a class list as shown below: (I randomly created 40 observations from 6 classes)

classes = list([-2,-1,0,1,2,9])

labels = list()
for i in range(0,40):
    labels.append(list([(lambda x: x+2 if x !=9 else 5)(random.sample(classes,1)[0])]))

index_aka_labels = torch.tensor(labels)
print(index_aka_labels)
print(index_aka_labels.shape)
torch.zeros(40, 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)

Finally, we have achieved our desired result of OHE:

tensor([[0, 0, 0, 0, 0, 1],
        [0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 1, 0],
        ... (40 observations)
        [0, 1, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0],
        [1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1],