Search code examples
pythonregexnon-greedy

Non-greedy search for beginning of string


I have the following links to be extracted:

[{"file":"https:\/\/www.rapidvideo.com\/loadthumb.php?v=FFIMB47EWD","kind":"thumbnails"}], 
    "sources": [
        {"file":"https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4",
         "label":"Standard (288p)","res":"288"},
        {"file":"https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4"

I would like to extract the links ending in mp4

My regex is as follows:

"file":"(https\:.*?\.mp4)"

However, I matches are wrong as the first link ending in a php is matched. I am practising here Pythex.org. How do I avoid the first link? The link to the html page I am trying to parse is https://www.rapidvideo.com/e/FFIMB47EWD


Solution

  • Why even use regular expressions? This looks like a JSON object/Python dict, you could just iterate through it and use str.endswith.

    >>> sources = {
    ...     "sources": [
    ...         {"file": "https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4",
    ...          "label": "Standard (288p)","res":"288"},
    ...         {"file": "https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4",
    ...          "label": "Standard (288p)","res":"288"}
    ...     ]
    ... }
    >>> for item in sources['sources']:
    ...     if item['file'].endswith('.mp4'):
    ...         print(item['file'])
    ... 
    https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4
    https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4
    

    EDIT:

    It looks like that link is available in a video tag after the javascript has loaded. You could use a headless browser but I just used selenium to fully load the page and then save the html.

    After you have the full page html, you can parse it using BeautifulSoup instead of regular expressions.

    Using regular expressions to parse HTML: why not?

    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    
    def extract_mp4_link(page_html):
        soup = BeautifulSoup(page_html, 'lxml')
        return soup.find('video')['src']
    
    
    def get_page_html(url):
        driver = webdriver.Chrome()
        driver.get(url)
        page_source = driver.page_source
        driver.close()
        return page_source
    
    
    if __name__ == '__main__':
        page_url = 'https://www.rapidvideo.com/e/FFIMB47EWD'
        page_html = get_page_html(page_url)
        print(extract_mp4_link(page_html))