Search code examples
pythonmachine-learningneural-networkpytorch-geometricgraph-neural-network

Can we use GNN on graphs with only edge features?


I'm trying to use GNN to classify phylogeny data (fully bifucated, single directed trees). I converted the trees from phylo format in R to PyTorch dataset. Taking one of the tree as an example:

enter image description here

Data(x=[83, 1], edge_index=[2, 82], edge_attr=[82, 1], y=[1], num_nodes=83)

It has 83 nodes (internals + tips, x=[83, 1]), I assigned 0s to all the nodes, so every node has a feature value 0. I constructed a 82 X 1 matrix containing all the lengths of the directed edges between nodes (edge_attr=[82, 1]), I intend to use edge_attr express edge lengths and use it as weights. There is a label for each tree for classification purpose (y=[1], values in {0, 1, 2}).

As you can see, node feature is not important in my case, the only thing matters is edge feature (edge length).

Below is my code implementation for modelling and training:

tree_dataset = TreeData(root=None, data_list=all_graphs)


class GCN(torch.nn.Module):
    def __init__(self, hidden_size=32):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(tree_dataset.num_node_features, hidden_size)
        self.conv2 = GCNConv(hidden_size, hidden_size)
        self.linear = Linear(hidden_size, tree_dataset.num_classes)

    def forward(self, x, edge_index, edge_attr, batch):
        # 1. Obtain node embeddings
        x = self.conv1(x, edge_index, edge_attr)
        x = x.relu()
        x = self.conv2(x, edge_index, edge_attr)

        # 2. Readout layer
        x = global_mean_pool(x, batch)  # [batch_size, hidden_channels]

        # 3. Apply a final classifier
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.linear(x)

        return x


model = GCN(hidden_size=32)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = torch.nn.CrossEntropyLoss()
train_loader = DataLoader(tree_dataset, batch_size=64, shuffle=True)
print(model)


def train():
    model.train()

    lost_all = 0
    for data in train_loader:
        optimizer.zero_grad()  # Clear gradients.
        out = model(data.x, data.edge_index, data.edge_attr, data.batch)  # Perform a single forward pass.
        loss = criterion(out, data.y)   # Compute the loss.
        loss.backward()  # Derive gradients.
        lost_all += loss.item() * data.num_graphs
        optimizer.step()  # Update parameters based on gradients.

    return lost_all / len(train_loader.dataset)

def test(loader):
    model.eval()

    correct = 0
    for data in loader:  # Iterate in batches over the training/test dataset.
        out = model(data.x, data.edge_index, data.edge_attr, data.batch)
        pred = out.argmax(dim=1)  # Use the class with highest probability.
        correct += int((pred == data.y).sum())  # Check against ground-truth labels.
    return correct / len(loader.dataset)  # Derive ratio of correct predictions.


for epoch in range(1, 20):
    loss = train()
    train_acc = test(train_loader)
    # test_acc = test(test_loader)
    print(f'Epoch: {epoch:03d}, Train Acc: {train_acc:.4f}, Loss: {loss:.4f}')

It seems that my code is not working at all:

......
Epoch: 015, Train Acc: 0.3333, Loss: 1.0988
Epoch: 016, Train Acc: 0.3333, Loss: 1.0979
Epoch: 017, Train Acc: 0.3333, Loss: 1.0938
Epoch: 018, Train Acc: 0.3333, Loss: 1.1044
Epoch: 019, Train Acc: 0.3333, Loss: 1.1012
...... 
Epoch: 199, Train Acc: 0.3333, Loss: 1.0965

Is it because we can't use GNN without meaningful node features? Or is there any problem with my implementation?


Solution

  • Setting all node features to 0 makes no sense. The meaning of the features of the nodes disappears. If there are no features for the node, there's a simple solution: creating embedding features for the nodes. You can use learnable embedding features as the features for the nodes.

    You can randomly initialize the initial embeddings and then feed these embeddings into the GCN. And the model can learn this embedding at the same time.

    Here's a simple implementation using PyTorch:

    import torch.nn as nn
    import torch
    
    class GCNModel(nn.Module):
        def __init__(self, num_nodes, hidden_size=32):
            super(GCNModel, self).__init__()
            self.node_embedding = nn.Embedding(num_nodes, hidden_size)
            # Initialize the embeddings with small random values
            nn.init.normal_(self.node_embedding.weight, std=0.1)
            self.conv1 = ...  # Your graph convolutional layer here
    
        def forward(self, edge_index, edge_attr, batch):
            x = self.node_embedding.weight
            x = self.conv1(x, edge_index, edge_attr)
            ...
            return x
    

    With this approach, the model can leverage these learnable embeddings to compensate for the absence of explicit node features and achieve improved performance on your task.