Requirement : I want to get names of all Product Names from a web-page.
Problem Statement:
After page is loaded in full, I applied RegEX below (works well) to get names of all Products on this page. My problem is it is still giving me output as if 'Load More' is not clicked. i.e. Only Product Names from first page are displayed. I need to tweak DownloadString so that it $content
below considers full page source (after page has loaded in full).
Code below: This web-page has 'Load More' button at end. I ran following script to click on 'Load More' button and goes on clicking it till Full page is displayed. This part of problem is resolved in another SO question and is working fine.
$ie = New-Object -COMObject InternetExplorer.Application
$ie.visible = $true
$site = $ie.Navigate('https://www.xxx.com/search/all?name=sporanox')
$ie.ReadyState
while($true)
{
while ($ie.Busy -and $ie.ReadyState -ne 4){ sleep -Milliseconds 100 }
try {
$link = $ie.Document.get_links() | where-object {$_.innerText -eq 'Load More'}
if ($link -ne $null)
{
if ($link.clientHeight -eq 0)
{
break
}
$link.click()
}
else
{
break
}
}
catch
{
break
}
}
$regex = [RegEx]'"item-name prdctNm">(.*?)</a>'
$url = ‘https://www.xxx.com/search/all?name=sporanox’
$wc = New-Object System.Net.WebClient
$content = $wc.DownloadString($url)
$regex.Matches($content) | ForEach-Object { $_.Groups[1].Value }
Instead of calling the page again (Which would be a second instance and have no connection to what you did previous) the information should be in the OuterHTML
$ie.Document.body.outerHTML
which contains data like this
<DIV class="col-sm-5 col-xs-8"><A class=item-name href="/details/drugs/39467/spasmonil-20mg">Spasmonil (20mg)</A>
<DIV class=text-small>2 ml</DIV>
<DIV class="item-manufacturer visible-xs">Cipla Limited</DIV></DIV>
<DIV class="col-sm-5 hidden-xs"><SPAN class=item-manufacturer>Cipla Limited</SPAN></DIV>
<DIV class="col-sm-2 col-xs-4 text-right">
<DIV class=item-actual>Rs. 6</DIV>
<DIV class=item-price>Rs. 6</DIV></DIV></DIV></LI>
<LI class="list-item item js-drug">
<DIV class=row>
<DIV class="col-sm-5 col-xs-8"><A class=item-name href="/details/drugs/40759/sprintas-75mg">Sprintas (75mg)</A>
<DIV class=text-small>28 Tablets</DIV>
<DIV class="item-manufacturer visible-xs">Intas Laboratories Pvt Ltd</DIV></DIV>
<DIV class="col-sm-5 hidden-xs"><SPAN class=item-manufacturer>Intas Laboratories Pvt Ltd</SPAN></DIV>
<DIV class="col-sm-2 col-xs-4 text-right">
<DIV class=item-actual>Rs. 5.72</DIV>
<DIV class=item-price>Rs. 5.72</DIV></DIV></DIV></LI>
<LI class="list-item item js-drug">
Have that line after the after the while loop should get you what you need. Will try and help with the parsing this is the data you are looking for I would think.
There has to be a better way to parse this but I am not yet well versed in HTML/XML parsing. I needed to change your string to match the text returned but both of these yeilded useful results.
$regex = 'item-name.*?>(.*?)</A>'
$ie.Document.body.outerHTML | Select-String -Pattern $regex -AllMatches | Foreach {$_.Matches} | ForEach-Object {$_.Value}
and
$drugs = $ie.Document.body.outerHTML -split "`r`n" | ForEach-Object{
If($_ -match $regex){
$Matches[1]
}
}
The ladder performed better with just the drug names stored as a string array in $drugs
. As of when I write this it returned 528 entries
...truncated output...
Spentron
Spencitron
Speucid Tab
Spasnil Drop (15ml)
Sparmex Tab
Spye Tab