I am scraping a website and I am trying to get a specific values from a tag within the HTML page. The HTML page has many other tags. The specific script I am targeting has all the images I need to scrape.
I am not able to scrape the images directly using Cheerio because they are not available on the main HTML page unless I click on the main image to see all other images.
What I need is something like this:
find the tag that has the key {someImages}, then for each key with the name {large}, return the value of this key.
I have created an example below to explain my problem.
Your help is very much appreciate
Thank you very much
<body>
<script type="text/javascript"> ... </script>
<script type="text/javascript"> ... </script>
<script type="text/javascript"> ... </script>
<script type="text/javascript"> ... </script>
.
.
.
<script type="text/javascript">
var data = {
'someImages': {
'initial': [
{
"hiRes": "https://somewebsite/images/imageName1.jpg",
"thumb": "https://somewebsite/images/imageName1.jpg",
"large": "https://somewebsite/images/imageName1.jpg", // I would like to be able to get the value of large from this line
"main": {
"https://somewebsite/images/imageName1.jpg": [1654],
"https://somewebsite/images/imageName1.jpg": [3416],
"https://somewebsite/images/imageName1.jpg": [7560]
}
},
{
"hiRes": "https://somewebsite/images/imageName2.jpg",
"thumb": "https://somewebsite/images/imageName2.jpg",
"large": "https://somewebsite/images/imageName2.jpg", // I would like to be able to get the value of large from this line
"main": {
"https://somewebsite/images/imageName2.jpg": [2234],
"https://somewebsite/images/imageName2.jpg": [3616],
"https://somewebsite/images/imageName2.jpg": [7849]
}
},
{
"hiRes": "https://somewebsite/images/imageName3.jpg",
"thumb": "https://somewebsite/images/imageName3.jpg",
"large": "https://somewebsite/images/imageName3.jpg", // I would like to be able to get the value of large from this line
"main": {
"https://somewebsite/images/imageName3.jpg": [2344],
"https://somewebsite/images/imageName3.jpg": [3556],
"https://somewebsite/images/imageName3.jpg": [7490]
}
},
]
}
</script>
<script type="text/javascript"> ... </script>
<script type="text/javascript"> ... </script>
<script type="text/javascript"> ... </script>
<script type="text/javascript"> ... </script>
.
.
.
</body>
A simple regex should do the trick. Use a capture group to capture the URLs.
/"large": ?"(.+?)",/g
Test it in Regexpal if you want