How can I use regex to help scrape web data?

I am trying to get URLs on a single youtube video page. youtube-dl can do this but I just need urls, so I want to learn how to do this.

it is my code to get page source: source = requests.get("https://www.youtube.com/watch?v=zXif_9RVadI")

I am looking for 21. line of this code: source_line_21 = source.text.split("\n")[20]

all urls starting with https://r[0-9] and including googlevideo.com/videoplayback and ending with ","

I tried many of code but I always getting 0 or 1 match. But there is 15-20 matches.

re.match(r'https:\/\/.*googlevideo.com/videoplayback.*mimeType', source_line_21)

I am not good at regex, I can not get it well. Thank you for all.

Output of print(source_line_21)[:32600] I am searching at here. It is too long, so I paste to there: print(source_line_21)[:32600]

Solution

The operation you're looking to perform is slightly more involved; but can be simplified through the use of a couple tools, listed below.

I've used urllib in the example because my requests request brought back Google's "Before your continue to YouTube" cookies confirmation page, but urllib allowed me to bypass that rubbish.

Tools:

urllib (or) requests
BeautifulSoup - via the bs4library
Regex - via the re library
JSON - via the json library

Logic:

Scrape the site data
Parse HTML using BeautifulSoup
Extract the tag(s) of interest
Iterate through the tags and look for the JavaScript variable of interest using regex
Iterate through the variable's contents (using JSON) to get the URLs

Code:

# Using urllib to read site content. 
source = urllib.request.urlopen("https://www.youtube.com/watch?v=zXif_9RVadI").read().decode()
# Parse HTML using BeautifulSoup
soup = bs(source, features='html.parser')
# Extract all <script> tags.
scripts = soup.findAll('script')
# Build regex pattern to extract the <script> tag's content.
exp = re.compile(r'^var\sytInitialPlayerResponse\s=\s(?P<content>.*\})')

# Iterate through all scripts to find the one with video content.
for s in scripts:
    if s.string:
        m = re.match(exp, s.string)
        if m:
            data = m.groupdict().get('content')

# Extract <script> of interest's content into JSON format.
content = json.loads(data)

# Collect all URIs into a list.
urls = []
for fmt in ['formats', 'adaptiveFormats']:
    for ele in content['streamingData'][fmt]:
        urls.append(ele['url'])

Confirm URIs:

# Print the detected URIs:
for i, url in enumerate(urls, 1):
    print(i, url[:75])

1 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
2 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
3 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
4 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
5 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
6 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
7 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
8 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
9 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
10 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
11 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
12 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
13 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
14 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
15 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
16 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202