Search code examples
nlpchatbothuggingface-transformershuggingface-tokenizersseq2seq

Trying to save history in tokenizer for seq2seq transformer chat model (GODEL base)


I'm fine-tunning a transformer seq2seeq model (GODEL base), but can't seem to save history in the tokenizers quite well. Here's the code:

context = list(df['Context'])
knowledge = list(df['Knowledge'])
response = list(df['Response'])

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/GODEL-v1_1-base-seq2seq", padding_side='left', truncation_side='left') 
for i in range(len(context)):
    # Prepare the history
    history = ""
    for j in range(i + 1):
        history += f"{context[j]} {knowledge[j]} {response[j]}"
# Tokenize the input sequences
    inputs = tokenizer(history, context[i], knowledge[i], padding= "longest",max_length=512, truncation=True, return_tensors="pt" )
    # Encode the response sequences
    outputs = tokenizer(history, response[i], padding="longest",max_length=512, truncation=True, return_tensors="pt" )

The output tokenizer should store the context of the present index and the context+knoweldge+response of all the previous indexes making history.


Solution

  • Here, I was trying to iterate over a pandas series and considering it a list. To resolve this, Use the .tolist() function on the pandas series before iterating over it.