Search code examples

Pytorch: How to get 2D data into a DataLoader?

I have a data set like this:

edge_origins = np.array([[0,1,2,3,4],[6,7,8]])
edge_destinations = np.array([[1,2,3,4,5],[7,8,9]])
target = np.array([0,1])
x = [[np.array([0.1,0.5,0.2]),np.array([0.5,0.6,0.23]),

This is a list of two networks. The first network has 6 nodes with 5 edges and a class 0, and then 4 nodes with 3 edges and class 1 networks.

I want to develop a model in Pytorch that will classify each network into it's class, and then i'll give it a new set of networks to classify.

So ultimately, I want to be able to shuffle these lists (simultaneously, i.e. maintaining the order between the data and the classes), split into train and test, and then read the train and test data into two data loaders, and feed these into a PyTorch network.

I wrote this:

edge_origins = np.array([[0,1,2,3,4],[6,7,8]])
edge_destinations = np.array([[1,2,3,4,5],[7,8,9]])
target = np.array([0,1])
x = [[np.array([0.1,0.5,0.2]),np.array([0.5,0.6,0.23]),

edge_index = torch.tensor([edge_origins, edge_destinations], dtype=torch.long)
dataset = Data(x=x, edge_index=edge_index, y=y, num_classes = len(set(target)))

And the error is:

    edge_index = torch.tensor([edge_origins, edge_destinations], dtype=torch.long)
ValueError: expected sequence of length 5 at dim 2 (got 3)

But then once that is fixed I think the next step is:

dataset = dataset.shuffle()

train_dataset = dataset[:1] #for toy example
test_dataset = dataset[1:]

print(f'Number of training graphs: {len(train_dataset)}')
print(f'Number of test graphs: {len(test_dataset)}')

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

class GCN(torch.nn.Module):
    def __init__(self, hidden_channels):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(dataset.num_node_features, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)
        self.conv3 = GCNConv(hidden_channels, hidden_channels)
        self.lin = Linear(hidden_channels, dataset.num_classes)

    def forward(self, x, edge_index, batch):
        # 1. Obtain node embeddings 
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = self.conv2(x, edge_index)
        x = x.relu()
        x = self.conv3(x, edge_index)

        # 2. Readout layer
        x = global_mean_pool(x, batch)  # [batch_size, hidden_channels]

        # 3. Apply a final classifier
        x = F.dropout(x, p=0.5,
        x = self.lin(x)
        return x

model = GCN(hidden_channels=64)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = torch.nn.CrossEntropyLoss()

def train():

    for data in train_loader:  # Iterate in batches over the training dataset.
         out = model(data.x, data.edge_index, data.batch)  # Perform a single forward pass.
         loss = criterion(out, data.y)  # Compute the loss.
         loss.backward()  # Derive gradients.
         optimizer.step()  # Update parameters based on gradients.
         optimizer.zero_grad()  # Clear gradients.

def test(loader):

     correct = 0
     for data in loader:  # Iterate in batches over the training/test dataset.
         out = model(data.x, data.edge_index, data.batch)  
         pred = out.argmax(dim=1)  # Use the class with highest probability.
         correct += int((pred == data.y).sum())  # Check against ground-truth labels.
     return correct / len(loader.dataset)  # Derive ratio of correct predictions.

for epoch in range(1, 171):
    train_acc = test(train_loader)
    test_acc = test(test_loader)
    print(f'Epoch: {epoch:03d}, Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}')

Could someone demonstrate to me how to get my data running into the Pytorch network above?


  • In Pytorch Geometric the Data object is used to contain only one graph. So you could iterate through all your arrays like so:

    data_list = []
    for i in range(2):
        edge_index_curr = torch.tensor([edge_origins[i],
        data = Data(x=torch.tensor(x[i]), edge_index=edge_index_curr, y=torch.tensor(target[i]))

    You can then use this list of Data to create your own Dataloader:

    loader = DataLoader(data_list, batch_size=32)

    If you need to split into train/val/test (I would advise having more than 2 samples for this case) you can do it manually or using sklearn.model_selection.

    For data augmentation if you really do have very little data, pytorch-geometric comes with transforms.