Search code examples
javascriptnode.jsjsonweb-scraping

Trying to extract a Json string out of javascript in HTML


I'm using NodeJS to do web scraping.

I have a complex HTML string. It contains a number of html tags and a few jave script blocks. Each javascript block contains js functions with a few parameters, and each parameter is a Json string. I'm only interested in those Json strings. What's the best way to extract them?

Sample code:

<html>
    <header>...</header>
    <script>function1(param1:[{a:"V1"},{b:"v2"}],param2:[{c:"v3"},{d:"v4"}])</script> 
    <script>...</script>
    <body>...</body>
</html>

Appreciate your advice.


Solution

  • First, parse the html with cheerio. This will allow you to correctly extract the javascript text from within the <script> tags using jQuery syntax a la $('script').text() (you'll want to loop through all of the script tags presumably though). Once you have the javascript itself extracted, use esprima to parse the javascript, find all the function calls, and find all the arguments that are literals. These two libraries will work more correctly than hacking something together with regular expressions. Start small, post a code snippet, and come back for help if you get stuck.