Search code examples
pythonhuggingface-transformershuggingface

IndexError: index out of range in self when using summarization, hugging face


I'm getting an error when attempting to run this code:

import nltk
nltk.download('punkt')
from youtube_transcript_api import YouTubeTranscriptApi

video_id = 'wK4XmXJ299k'

transcript = YouTubeTranscriptApi.get_transcript(video_id)

corpus = ' '.join([line['text'] for line in transcript])

from transformers import pipeline
mysummarization = pipeline("summarization")
mysummary = mysummarization(corpus)
mysummary[0]['summary_text']

The code gets a transcript from a YouTube video and attempts to summarize with the Hugging Face Transformers model. The error is IndexError: index out of range in self.

I am also seeing a No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6). Using a pipeline without specifying a model name and revision in production is not recommended. Token indices sequence length is longer than the specified maximum sequence length for this model (11628 > 1024). Running this sequence through the model will result in indexing errors message as well.

How do I fix this?


Solution

  • import nltk
    nltk.download('punkt')
    from youtube_transcript_api import YouTubeTranscriptApi
    
    video_id = 'wK4XmXJ299k'
    
    transcript = YouTubeTranscriptApi.get_transcript(video_id)
    
    corpus = ' '.join([line['text'] for line in transcript[:100]]) # bcz of large text
    print(corpus)
    from transformers import pipeline
    mysummarization = pipeline("summarization", min_length=30, max_length=90)
    mysummary = mysummarization(corpus)
    print(mysummary[0]['summary_text'])
    

    Output - Aaron Rodgers is fresh out of a victory over the Super Bowl champion Los Angeles Rams and Lambeau last evening on Monday Night Football . The back-to-back NFL MVP says he's looking forward to a holiday party .