Search code examples
pythontensorflowmachine-learninghuggingface-transformersbert-language-model

How to resolve BERT HF Model ValueError: too many values to unpack (expected 2)?


I have a dummy dataset of two text columns and labels as below.

import tensorflow as tf
from transformers import BertTokenizer, TFAutoModelForSequenceClassification
import numpy as np
from datasets import Dataset, DatasetDict

# Create a synthetic dataset with two text columns and a label column (0 or 1)
data_size = 1000
text_column1 = ["This is sentence {}.".format(i) for i in range(data_size)]
text_column2 = ["Another sentence {} for tokenization.".format(i) for i in range(data_size)]
labels = np.random.choice([0, 1], size=data_size)

I am using the HF bert model for classification.(TFAutoModelForSequenceClassification)

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model2 = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")

When using the below code for preparing the dataset and model training, the execution is successful.

def tokenize_dataset(df):
    # Keys of the returned dictionary will be added to the dataset as columns
    return tokenizer(df['text_column1'], df['text_column2'])
# Convert to a DataFrame
import pandas as pd
df = pd.DataFrame({'text_column1': text_column1, 'text_column2': text_column2, 'label': labels})
df =  Dataset.from_pandas(df).map(tokenize_dataset) 
tf_train = model2.prepare_tf_dataset(df, batch_size=4, shuffle=True, tokenizer=tokenizer)
model2.compile(optimizer=Adam(3e-5))  # No loss argument!
model2.fit(tf_train)

The above code works successfully.

However when I use padding, truncation and max_length in the tokenizer, i.e as below

def tokenize_dataset(df):
    # Keys of the returned dictionary will be added to the dataset as columns
    return tokenizer(df['text_column1'], df['text_column2'], padding=True,truncation=True,max_length=30, return_tensors="tf")
# Convert to a DataFrame
import pandas as pd
df = pd.DataFrame({'text_column1': text_column1, 'text_column2': text_column2, 'label': labels})
df =  Dataset.from_pandas(df).map(tokenize_dataset) 
tf_train = model2.prepare_tf_dataset(df, batch_size=4, shuffle=True, tokenizer=tokenizer)
model2.compile(optimizer=Adam(3e-5))  # No loss argument!
model2.fit(tf_train)

This code gave the following error:

ValueError: Exception encountered when calling layer 'bert' (type TFBertMainLayer).
        
   

 in user code:
    
        File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_tf_utils.py", line 1557, in run_call_with_unpacked_inputs  *
            return func(self, **unpacked_inputs)
        File "/usr/local/lib/python3.10/dist-packages/transformers/models/bert/modeling_tf_bert.py", line 766, in call  *
            batch_size, seq_length = input_shape
    
        ValueError: too many values to unpack (expected 2)

I am not able to understand why that will happen. If it happens then, why is it so and how to resolve the error?


Solution

  • The prepare_tf_dataset function does not require the dataframe to have tensor value types. Removing return_tensors='tf' should solve the problem.

    def tokenize_dataset(df):
        # Keys of the returned dictionary will be added to the dataset as columns
        return tokenizer(
            df['text_column1'],
            df['text_column2'],
            padding=True,
            truncation=True,
            max_length=30)