Search code examples
javascriptweb-scrapingcheerio

Javascript - How to find and get specific values from a <script> tag in HTML


I am scraping a website and I am trying to get a specific values from a tag within the HTML page. The HTML page has many other tags. The specific script I am targeting has all the images I need to scrape.

I am not able to scrape the images directly using Cheerio because they are not available on the main HTML page unless I click on the main image to see all other images.

What I need is something like this:

find the tag that has the key {someImages}, then for each key with the name {large}, return the value of this key.

I have created an example below to explain my problem.

Your help is very much appreciate

Thank you very much

<body>
    <script type="text/javascript"> ... </script>
    <script type="text/javascript"> ... </script>
    <script type="text/javascript"> ... </script>
    <script type="text/javascript"> ... </script>
    .
    .
    .
    
    <script type="text/javascript">
        var data = {
            'someImages': {
                'initial': [
                        {
                        "hiRes": "https://somewebsite/images/imageName1.jpg",
                        "thumb": "https://somewebsite/images/imageName1.jpg",
                        "large": "https://somewebsite/images/imageName1.jpg", // I would like to be able to get the value of large from this line
                        "main": { 
                            "https://somewebsite/images/imageName1.jpg": [1654],
                            "https://somewebsite/images/imageName1.jpg": [3416],
                            "https://somewebsite/images/imageName1.jpg": [7560]
                            }
                        }, 
                    
                        {
                    "hiRes": "https://somewebsite/images/imageName2.jpg",
                    "thumb": "https://somewebsite/images/imageName2.jpg",
                    "large": "https://somewebsite/images/imageName2.jpg", // I would like to be able to get the value of large from this line
                    "main": { 
                        "https://somewebsite/images/imageName2.jpg": [2234],
                        "https://somewebsite/images/imageName2.jpg": [3616],
                        "https://somewebsite/images/imageName2.jpg": [7849]
                        }
                    },

                    {
                    "hiRes": "https://somewebsite/images/imageName3.jpg",
                    "thumb": "https://somewebsite/images/imageName3.jpg",
                    "large": "https://somewebsite/images/imageName3.jpg", // I would like to be able to get the value of large from this line
                    "main": { 
                        "https://somewebsite/images/imageName3.jpg": [2344],
                        "https://somewebsite/images/imageName3.jpg": [3556],
                        "https://somewebsite/images/imageName3.jpg": [7490]
                        }
                    },
                ]
            }
            
    </script>
    
    
    <script type="text/javascript"> ... </script>
    <script type="text/javascript"> ... </script>
    <script type="text/javascript"> ... </script>
    <script type="text/javascript"> ... </script>
    .
    .
    .

</body>

Solution

  • A simple regex should do the trick. Use a capture group to capture the URLs.

    /"large": ?"(.+?)",/g
    

    Test it in Regexpal if you want