python huggingface-transformers huggingface

IndexError: index out of range in self when using summarization, hugging face

I'm getting an error when attempting to run this code:

import nltk
nltk.download('punkt')
from youtube_transcript_api import YouTubeTranscriptApi

video_id = 'wK4XmXJ299k'

transcript = YouTubeTranscriptApi.get_transcript(video_id)

corpus = ' '.join([line['text'] for line in transcript])

from transformers import pipeline
mysummarization = pipeline("summarization")
mysummary = mysummarization(corpus)
mysummary[0]['summary_text']

The code gets a transcript from a YouTube video and attempts to summarize with the Hugging Face Transformers model. The error is IndexError: index out of range in self.

I am also seeing a No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6). Using a pipeline without specifying a model name and revision in production is not recommended. Token indices sequence length is longer than the specified maximum sequence length for this model (11628 > 1024). Running this sequence through the model will result in indexing errors message as well.

How do I fix this?

Solution

import nltk
nltk.download('punkt')
from youtube_transcript_api import YouTubeTranscriptApi

video_id = 'wK4XmXJ299k'

transcript = YouTubeTranscriptApi.get_transcript(video_id)

corpus = ' '.join([line['text'] for line in transcript[:100]]) # bcz of large text
print(corpus)
from transformers import pipeline
mysummarization = pipeline("summarization", min_length=30, max_length=90)
mysummary = mysummarization(corpus)
print(mysummary[0]['summary_text'])

Output - Aaron Rodgers is fresh out of a victory over the Super Bowl champion Los Angeles Rams and Lambeau last evening on Monday Night Football . The back-to-back NFL MVP says he's looking forward to a holiday party .