Search code examples
javascriptnode.jscheerio

cheerio find a text in a script tag


I want to extract js script in script tag.

this the script tag :

<script>
  $(document).ready(function(){

    $("#div1").click(function(){
      $("#divcontent").load("ajax.content.php?p=0&cat=1");
    });

    $("#div2").click(function(){
      $("#divcontent").load("ajax.content.php?p=1&cat=1");
    });

  });
</script>

I have an array of ids like ['div1', 'div2'], and I need to extract url link inside it : so if i call a function :

getUrlOf('div1');

it will return ajax.content.php?p=0&cat=1


Solution

  • With Cheerio, it is very easy to get the text of the script tag:

    const cheerio = require('cheerio');
    const $ = cheerio.load("the HTML the webpage you are scraping");
    
    // If there's only one <script>
    console.log($('script').text());
    
    // If there's multiple scripts
    $('script').each((idx, elem) => console.log(elem.text()));
    

    From here, you're really just asking "how do I parse a generic block of javascript and extract a list of links". I agree with Patrick above in the comments, you probably shouldn't. Can you craft a regex that will let you find each link in the script and deduce the page it links to? Yes. But very likely, if anything about this page changes, your script will immediately break - the author of the page might switch to inline <a> tags, refactor the code, use live events, etc.

    Just be aware that relying on the exact contents of this script tag will make your application very brittle -- even more brittle than page scraping generally is.

    Here's an example of a loose but effective regex:

    let html = "incoming html";
    let regex = /\$\("(#.+?)"\)\.click(?:.|\n)+?\.load\("(.+?)"/;
    let match;
    
    while (match = regex.exec(html)) {
        console.log(match[1] + ': ' + match[2]);
    }
    

    In case you are new to regex: this expression contains two capture groups, in parens (the first is the div id, the second is the link text), as well as a non-capturing group in the middle, which exists only to make sure the regex will continue through a line break. I say it's "loose" because the match it is looking for looks like this:

    • $("***").click***ignored chars***.load("***"

    So, depending on how much javascript there is and how similar it is, you might have to tighten it up to avoid false positives.