Search code examples
pythonpython-3.xregexpython-re

How can I use regex to help scrape web data?


I am trying to get URLs on a single youtube video page. youtube-dl can do this but I just need urls, so I want to learn how to do this.

it is my code to get page source: source = requests.get("https://www.youtube.com/watch?v=zXif_9RVadI")

I am looking for 21. line of this code: source_line_21 = source.text.split("\n")[20]

all urls starting with https://r[0-9] and including googlevideo.com/videoplayback and ending with ","

I tried many of code but I always getting 0 or 1 match. But there is 15-20 matches.

re.match(r'https:\/\/.*googlevideo.com/videoplayback.*mimeType', source_line_21)

I am not good at regex, I can not get it well. Thank you for all.

Output of print(source_line_21)[:32600] I am searching at here. It is too long, so I paste to there: print(source_line_21)[:32600]


Solution

  • The operation you're looking to perform is slightly more involved; but can be simplified through the use of a couple tools, listed below.

    I've used urllib in the example because my requests request brought back Google's "Before your continue to YouTube" cookies confirmation page, but urllib allowed me to bypass that rubbish.

    Tools:

    • urllib (or) requests
    • BeautifulSoup - via the bs4library
    • Regex - via the re library
    • JSON - via the json library

    Logic:

    1. Scrape the site data
    2. Parse HTML using BeautifulSoup
    3. Extract the tag(s) of interest
    4. Iterate through the tags and look for the JavaScript variable of interest using regex
    5. Iterate through the variable's contents (using JSON) to get the URLs

    Code:

    # Using urllib to read site content. 
    source = urllib.request.urlopen("https://www.youtube.com/watch?v=zXif_9RVadI").read().decode()
    # Parse HTML using BeautifulSoup
    soup = bs(source, features='html.parser')
    # Extract all <script> tags.
    scripts = soup.findAll('script')
    # Build regex pattern to extract the <script> tag's content.
    exp = re.compile(r'^var\sytInitialPlayerResponse\s=\s(?P<content>.*\})')
    
    # Iterate through all scripts to find the one with video content.
    for s in scripts:
        if s.string:
            m = re.match(exp, s.string)
            if m:
                data = m.groupdict().get('content')
    
    # Extract <script> of interest's content into JSON format.
    content = json.loads(data)
    
    # Collect all URIs into a list.
    urls = []
    for fmt in ['formats', 'adaptiveFormats']:
        for ele in content['streamingData'][fmt]:
            urls.append(ele['url'])
    

    Confirm URIs:

    # Print the detected URIs:
    for i, url in enumerate(urls, 1):
        print(i, url[:75])
    
    1 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    2 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    3 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    4 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    5 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    6 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    7 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    8 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    9 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    10 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    11 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    12 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    13 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    14 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    15 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
    16 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202