I have the following links to be extracted:
[{"file":"https:\/\/www.rapidvideo.com\/loadthumb.php?v=FFIMB47EWD","kind":"thumbnails"}],
"sources": [
{"file":"https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4",
"label":"Standard (288p)","res":"288"},
{"file":"https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4"
I would like to extract the links ending in mp4
My regex is as follows:
"file":"(https\:.*?\.mp4)"
However, I matches are wrong as the first link ending in a php is matched. I am practising here Pythex.org. How do I avoid the first link? The link to the html page I am trying to parse is https://www.rapidvideo.com/e/FFIMB47EWD
Why even use regular expressions? This looks like a JSON object/Python dict, you could just iterate through it and use str.endswith
.
>>> sources = {
... "sources": [
... {"file": "https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4",
... "label": "Standard (288p)","res":"288"},
... {"file": "https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4",
... "label": "Standard (288p)","res":"288"}
... ]
... }
>>> for item in sources['sources']:
... if item['file'].endswith('.mp4'):
... print(item['file'])
...
https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4
https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4
EDIT:
It looks like that link is available in a video
tag after the javascript has loaded. You could use a headless browser but I just used selenium
to fully load the page and then save the html.
After you have the full page html, you can parse it using BeautifulSoup
instead of regular expressions.
Using regular expressions to parse HTML: why not?
from bs4 import BeautifulSoup
from selenium import webdriver
def extract_mp4_link(page_html):
soup = BeautifulSoup(page_html, 'lxml')
return soup.find('video')['src']
def get_page_html(url):
driver = webdriver.Chrome()
driver.get(url)
page_source = driver.page_source
driver.close()
return page_source
if __name__ == '__main__':
page_url = 'https://www.rapidvideo.com/e/FFIMB47EWD'
page_html = get_page_html(page_url)
print(extract_mp4_link(page_html))