I am trying to get URLs on a single youtube video page. youtube-dl can do this but I just need urls, so I want to learn how to do this.
it is my code to get page source: source = requests.get("https://www.youtube.com/watch?v=zXif_9RVadI")
I am looking for 21. line of this code: source_line_21 = source.text.split("\n")[20]
all urls starting with https://r[0-9]
and including googlevideo.com/videoplayback
and ending with ","
I tried many of code but I always getting 0 or 1 match. But there is 15-20 matches.
re.match(r'https:\/\/.*googlevideo.com/videoplayback.*mimeType', source_line_21)
I am not good at regex, I can not get it well. Thank you for all.
Output of print(source_line_21)[:32600]
I am searching at here. It is too long, so I paste to there: print(source_line_21)[:32600]
The operation you're looking to perform is slightly more involved; but can be simplified through the use of a couple tools, listed below.
I've used urllib
in the example because my requests
request brought back Google's "Before your continue to YouTube" cookies confirmation page, but urllib
allowed me to bypass that rubbish.
urllib
(or) requests
bs4
libraryre
libraryjson
library# Using urllib to read site content.
source = urllib.request.urlopen("https://www.youtube.com/watch?v=zXif_9RVadI").read().decode()
# Parse HTML using BeautifulSoup
soup = bs(source, features='html.parser')
# Extract all <script> tags.
scripts = soup.findAll('script')
# Build regex pattern to extract the <script> tag's content.
exp = re.compile(r'^var\sytInitialPlayerResponse\s=\s(?P<content>.*\})')
# Iterate through all scripts to find the one with video content.
for s in scripts:
if s.string:
m = re.match(exp, s.string)
if m:
data = m.groupdict().get('content')
# Extract <script> of interest's content into JSON format.
content = json.loads(data)
# Collect all URIs into a list.
urls = []
for fmt in ['formats', 'adaptiveFormats']:
for ele in content['streamingData'][fmt]:
urls.append(ele['url'])
# Print the detected URIs:
for i, url in enumerate(urls, 1):
print(i, url[:75])
1 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
2 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
3 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
4 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
5 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
6 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
7 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
8 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
9 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
10 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
11 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
12 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
13 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
14 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
15 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
16 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202