Search code examples
pythontensorflowpytorchhuggingface-transformersword-embedding

Mapping embeddings to labels in PyTorch/Huggingface


I am currently working on a project where I am using a pre-trained transformer model to generate embeddings for DNA sequences (some have a '1' label and some have a '0' label). I'm trying to map these embeddings back to their corresponding labels in my dataset, but I'm encountering an IndexError when attempting to do so. I think it has to do with the fact that I am batching since I'm running out of memory.

Here is the code I'm working with:

from datasets import Dataset
from transformers import AutoTokenizer, AutoModel
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")
model = AutoModel.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")

# Load the dataset
ds1 = Dataset.from_file('training.arrow') #this is already tokenized

# Convert tokenized sequences to tensor
inputs = torch.tensor(ds1['input_ids']).to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

# Reduce batch size
batch_size = 4

# Pass tokenized sequences through the model with reduced batch size
with torch.no_grad():
    outputs = model(input_ids=inputs[:batch_size], output_hidden_states=True)

# Extract embeddings
hidden_states = outputs.hidden_states
embeddings1 = hidden_states[-1]

Here is the information about the size of the output embeddings and the original dataset:

embeddings1.shape
torch.Size([4, 86, 1280])


ds1
Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 22535512
})

I'm having a hard time figuring out how to map the labels back to the output embeddings, especially with the big discrepancy with the sizes. As you can see, I have 22million sequences, I would like a an embedding for each sequence.

My plan is to use these embeddings for downstream prediction using another model. I have already split my data into train, test, and val, but does it make more sense to get the embeddings for a label1 dataset and label0 dataset and then combine and then split into train/test, so I don't have to worry about the mapping of the labels?


Solution

  • You can use the .map function in the dataset to append the embeddings. I suggest you run this on GPU instead of CPU since nos of rows is very high.

    Please try running the code below.

    import torch
    from datasets import Dataset
    from transformers import AutoTokenizer, AutoModel
    
    device = torch.device("cuda" if torch.cuda.is_available() else "CPU")
    
    # Load the tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")
    model = AutoModel.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref", device_map = device)
    
    # Load the dataset
    ds = Dataset.from_file('training.arrow') #this is already tokenized
    
    # Convert tokenized sequences to tensor
    inputs = torch.tensor(ds['input_ids']).to(device)
    
    # Reduce batch size
    batch_size = 4
    
    def get_embeddings(data):
    
        # Convert tokenized sequences to tensor
        input_ids =  torch.tensor(data['input_ids']).to(device)
    
        # Pass tokenized sequences through the model with reduced batch size
        with torch.no_grad():
            outputs = model(input_ids, output_hidden_states=True)
        
        hidden_states = outputs.hidden_states
        embeddings = hidden_states[-1]
    
        return {'embeddings' : embeddings.detach().cpu()}
    
    # Extract embeddings
    ds = ds.map(get_embeddings, batched=True, batch_size=batch_size)
    ds