Search code examples
pythonhuggingface-transformers

Using Huggingface zero-shot text classification with large data set


I'm trying to use Huggingface zero-shot text classification using 12 labels with large data set (57K sentences) read from a CSV file as follows:

csv_file = tf.keras.utils.get_file('batch.csv', filename)
df = pd.read_csv(csv_file)
classifier = pipeline('zero-shot-classification')
results = classifier(df['description'].to_list(), labels, multi_class=True)

This keeps crashing as python runs out of memory. I tried to create a dataset instead as follows:

dataset = load_dataset('csv', data_files=filename)

But not sure how to use it with Huggingface's classifier. What is the best way to batch process classification?

I eventually would like to feed it over 1M sentences for classification.


Solution

  • The problem isn't that your dataset is too big to fit into RAM, but that you're trying to pass the whole thing through a large transformer model at once. Hugging Face's pipelines don't do any mini-batching under the hood at the moment, so pass the sequences one by one or in small subgroups instead:

    results = [classifier(desc, labels, multi_class=True for desc in df['description']]
    

    If you're using a GPU, you'll get the best speed by using as many sequences at each pass as will fit into the GPU's memory, so you could try the following:

    batch_size = 4 # see how big you can make this number before OOM
    classifier = pipeline('zero-shot-classification', device=0) # to utilize GPU
    sequences = df['description'].to_list()
    results = []
    for i in range(0, len(sequences), batch_size):
        results += classifier(sequences[i:i+batch_size], labels, multi_class=True)
    

    and see how large you can make batch_size before you get OOM errors.