Search code examples
pythonweb-scrapingnlpyoutubevideo-streaming

How to extract subtitles from Youtube videos in varied languages


I have used the code below to extract subtitles from YouTube videos, but it only works for videos in English. I have some videos in Spanish, so I would like to know how I can modify the code to extract Spanish subtitles too?

from pytube import YouTube
from youtube_transcript_api import YouTubeTranscriptApi

# Define the video URL or ID of the YouTube video you want to extract text from
video_url = 'https://www.youtube.com/watch?v=xYgoNiSo-kY'

# Download the video using pytube
youtube = YouTube(video_url)
video = youtube.streams.get_highest_resolution()
video.download()

# Get the downloaded video file path
video_path = video.default_filename

# Get the video ID from the URL
video_id = video_url.split('v=')[-1]

# Get the transcript for the specified video ID
transcript = YouTubeTranscriptApi.get_transcript(video_id)

# Extract the text from the transcript
captions_text = ''
for segment in transcript:
    caption = segment['text']
    captions_text += caption + ' '

# Print the extracted text
print(captions_text)

Solution

  • Use - list_transcripts - for get the list of available languages:

    Example:

    video_id = 'xYgoNiSo-kY'
    transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
    

    Then, loop the transcript_list variable to see the available languages obtained:

    Example:

    for x, tr in enumerate(transcript_list):
      print(tr.language_code)
    

    In this case, the result is:

    es

    Modify your code for loop the languages available on the video and download the generated captions:

    Example:

    # Variables for store the downloaded captions:
    all_captions = []
    caption = None
    captions_text = ''
    
    # Loop all languages available for this video and download the generated captions:
    for x, tr in enumerate(transcript_list):
      print("Downloading captions in " + tr.language + "...")
      transcript_obtained_in_language = transcript_list.find_transcript([tr.language_code]).fetch()
      for segment in transcript_obtained_in_language:
        caption = segment['text']
        captions_text += caption + ' '
      all_captions.append({"language " : tr.language_code + " - " + tr.language, "captions" : captions_text})
      caption = None
      captions_text = ''
      print("="*20)
    print("Done")
    

    In the all_captions variable, will be stored the captions and the language obtained from the given VIDEO_ID.