Search code examples
huggingface-tokenizers

word embeddings with BioGpt


I need help to generate word embeddings and store them in a column of a pandas DataFrame. What should I do?

import json
import pandas as pd

from transformers import BioGptTokenizer

with open("data.json") as input_data:
    df = pd.DataFrame.from_records(json.load(input_data))

bio_tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")

df["embedding"] = df["content"].apply(lambda x: <what to do here?>)

what is the lambda function that I need ?

thanks


Solution

  • import json
    import pandas as pd
    from transformers import BioGptTokenizer, BioGptModel
    
    def get_embedding(sentence, model, tokenizer):
      inputs = tokenizer(sentence, return_tensors="pt")
      outputs = model(**inputs)
      embedding = outputs.last_hidden_state
      return embedding
    
    with open("data.json") as input_data:
        df = pd.DataFrame.from_records(json.load(input_data))
    
    bio_tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
    model = BioGptModel.from_pretrained("microsoft/biogpt")
    
    df["embedding"] = df["content"].apply(lambda x: get_embedding(x, model, bio_tokenizer))
    

    Remember that if you are storing the embedding in a dataframe, pandas will convert the datatype of the embedding from tensor to object. So you will need to change the datatype before use.