Search code examples
javascriptphpdompreg-match

PHP DOM Get website all scripts src


I am want to get all scripts src links from a website using curl and DOM.

I have this code:

$scripts = $dom->getElementsByTagName('script');

foreach ($scripts as $scripts1) {

    if($scripts1->getAttribute('src')) {

        echo $scripts1->getAttribute('src');

    }

}

This script working perfeclty but what happens if a website has a script tag like this:

<script type="text/javascript">
window._wpemojiSettings = {"source":{"concatemoji":"http:\/\/domain.com\/wp-includes\/js\/wp-emoji-release.min.js?ver=4.2.4"}}; ........
</script>

I need also to get this script src. How can I do that?


Solution

  • If you first parser comes empty, I'd create another using a regex, i.e.:

    $html = file_get_contents("http://somesite.com/");
    
    preg_match_all('/<script.*?(http.*?\.js(?:\?.*?)?)"/si', $html, $matches, PREG_PATTERN_ORDER);
    for ($i = 0; $i < count($matches[1]); $i++) {
        echo str_replace("\\/", "/", $matches[1][$i]);
    }
    

    You may have to adjust the regex to work with different websites but the above code should give you an idea of what you need.


    DEMO: http://ideone.com/Fwf6Mb


    Regex Explanation:

    <script.*?(http.*?\.js(?:\?.*?)?)"
    ----------------------------------
    
    Match the character string “<script” literally «<script»
    Match any single character «.*?»
       Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
    Match the regex below and capture its match into backreference number 1 «(http.*?\.js(?:\?.*?)?)»
       Match the character string “http” literally «http»
       Match any single character «.*?»
          Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
       Match the character “.” literally «\.»
       Match the character string “js” literally «js»
       Match the regular expression below «(?:\?.*?)?»
          Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
          Match the character “?” literally «\?»
          Match any single character «.*?»
             Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
    Match the character “"” literally «"»
    

    Regex Tutorial

    http://www.regular-expressions.info/tutorial.html