Search code examples
javascriptweb-scrapingblobhttp-live-streamingm3u8

Scraping the path of a m3u8 file


I'm currently trying to scrape a unique value in the m3u8 url path of an embedded video for self-study. Each embedded video on the site shares the same url path except for the unique value.

For example, from the https://headlines.yahoo.co.jp/videonews/ann?a=20190526-00000026-ann-int page, I can find the m3u8 path through the inspector's network tab:

https://gw-yvpub.c.yimg.jp/v1/hls/CFukHuaO2W13gxbJ/video.m3u8

The unique value here is CFukHuaO2W13gxbJ. However, I cannot for the life of me find this value anywhere in the page source or anywhere else in the inspector tabs. Is it possible to find this url link in the page source or where this url is generated?

Side note: A request call is made to this blob url right before request calls to the m3u8 file:

blob:https://s.yimg.jp/f23ed5ca-7a95-4409-bf66-c26c577157d2

Thanks in advance for any guidance!


Solution

  • The m3u8 urls are present in request made to this url:

    https://feapi-yvpub.yahooapis.jp/v1/content/1576087?appid=dj0zaiZpPVZMTVFJR0FwZWpiMyZzPWNvbnN1bWVyc2VjcmV0Jng9YjU-&output=json&space_id=2078710316&domain=headlines.yahoo.co.jp&ak=044ddff76151606c2d97ada9daa3ea45&device_type=1100&thumb_width=1204&thumb_height=676&thumb_priority=l&thumb_bd=0
    

    Values for that come from your given url here:

    <script type="text/javascript">
    YAHOO.JP.srch.dlink.onLoad(function(sl) {
        sl.setParams({"serviceCode":"nws","appID":"dj0zaiZpPWlzQ3RiOHo1cGxBNSZzPWNvbnN1bWVyc2VjcmV0Jng9ODQ-","articleID":"20190526-00000026-ann","category":null,"mediaID":"ann","spaceID":2078710316,"linkCount":"5","launchAfterDocLoad":false});
    });
    </script>

    As well as content id seen, for example

    <script type="text/javascript" class="yvpub-player" src="https://s.yimg.jp/images/yvpub/player/js/embed.js?contentid=1576087&amp;width=602&amp;height=338&amp;propertyname=jp_news&amp;spaceid=2078710316&amp;repeat=0&amp;recommend=0&amp;autostart=1" data-composed="1"></script>

    This 044ddff76151606c2d97ada9daa3ea45 is an access key I think. Not sure if that is something you can re-use across requests. Perhaps also look at the API documentation if there is any. Has a whiff of random hash (probably governed by length) - that could pose problems.