Search code examples
javascriptphpcurlpreg-match

How to preg_match this text/javascript in html loaded


When I view-source a html page, I saw this in text/javascript tag:

playlist = [{
    title: "",
    thumnail: "//example.com/folder/c9cc7f89fe5c168551bca2111d479a3e_1515576875.jpg",
    source: "https://examp.com/360/HX62.mp4?authen=exp=1517246689~acl=/82vL3DDTye4/*~hmac=977cefd9de63a29fde25c856e0fdfd2f",
    sourceLevel: [
        {
            source: "https://examp.com/360/HX62.mp4?authen=exp=1517246689~acl=/82vL3DDTye4/*~hmac=977cefd9de63a29fde25c856e0fdfd2f",
            label: '360p'
        },
        {
            source: "https://examp.com/480/HX62.mp4?authen=exp=1517246689~acl=/SuCa7NnGEhM/*~hmac=80bc89a07b1f4ed87d584a89c623e946",
            label: '480p'
        },
        {
            source: "https://examp.com/720/HX62.mp4?authen=exp=1517246689~acl=/SuCa7NnGEhM/*~hmac=80bc89a07b1f4ed87d584a89c623e946",
            label: '720p'
        },
    ],
}];

I want to get strings in source and label, then I've write this code:

$page = curl ('https://example.com/video-details.html')
preg_match ('#sourceLevel:[{source: "(.*?)",label: \'360p\'},{source: "(.*?)",label: \'480p\'},{source: "(.*?)",label: \'720\'}#', $page, $source);
$data360 = $source[1];
$data480 = $source[2];
$data720 = $source[3];
echo $data360. '<br/>' .$data480. '<br/>' .$data720. '<br/>';

I know it can be wrong in somewhere, because I'm new to PHP. I'm hoping there is someone help me to correct my code. Many thanks!


Solution

  • You need to:

    • escape braces and square brackets in your regular expression as they have special meanings in regexes,
    • escape the single quotes in the string literal for which you chose the single quote as delimiter (which you corrected after I wrote this).
    • provide for the white space that can appear between several characters (e.g. before and after {) in your page string.

    I would also suggest to match the source/labels each as separate matches, so that when there are not exactly three, you will still have them all.

    Here is the suggested code:

    preg_match_all('~\{\s*source\s*:\s*"(.*?)"\s*,\s*label\s*:\s*\'(.*?)\'\s*\}~', 
                   $page, $sources);
    
    $sources = array_combine($sources[2], $sources[1]);
    

    This will provide the $sources variable as an associative array, keyed by the labels:

    [
        "360p" => "https://examp.com/360/HX62.mp4?authen=exp=1517246689~acl=/82vL3DDTye4/*~hmac=977cefd9de63a29fde25c856e0fdfd2f",
        "480p" => "https://examp.com/480/HX62.mp4?authen=exp=1517246689~acl=/SuCa7NnGEhM/*~hmac=80bc89a07b1f4ed87d584a89c623e946",
        "720p" => "https://examp.com/720/HX62.mp4?authen=exp=1517246689~acl=/SuCa7NnGEhM/*~hmac=80bc89a07b1f4ed87d584a89c623e946"
    ]