Using Huggingface zero-shot text classification with large data set

I'm trying to use Huggingface zero-shot text classification using 12 labels with large data set (57K sentences) read from a CSV file as follows:

csv_file = tf.keras.utils.get_file('batch.csv', filename)
df = pd.read_csv(csv_file)
classifier = pipeline('zero-shot-classification')
results = classifier(df['description'].to_list(), labels, multi_class=True)

This keeps crashing as python runs out of memory. I tried to create a dataset instead as follows:

dataset = load_dataset('csv', data_files=filename)

But not sure how to use it with Huggingface's classifier. What is the best way to batch process classification?

I eventually would like to feed it over 1M sentences for classification.

Solution

The problem isn't that your dataset is too big to fit into RAM, but that you're trying to pass the whole thing through a large transformer model at once. Hugging Face's pipelines don't do any mini-batching under the hood at the moment, so pass the sequences one by one or in small subgroups instead:

results = [classifier(desc, labels, multi_class=True for desc in df['description']]

If you're using a GPU, you'll get the best speed by using as many sequences at each pass as will fit into the GPU's memory, so you could try the following:

batch_size = 4 # see how big you can make this number before OOM
classifier = pipeline('zero-shot-classification', device=0) # to utilize GPU
sequences = df['description'].to_list()
results = []
for i in range(0, len(sequences), batch_size):
    results += classifier(sequences[i:i+batch_size], labels, multi_class=True)

and see how large you can make batch_size before you get OOM errors.