I'm using the following code:
Dim cl As WebClient = New WebClient()
Dim html As String = cl.DownloadString(url)
Dim doc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument()
doc.LoadHtml(html)
Dim table As HtmlNode = doc.DocumentNode.SelectSingleNode("//table[@class='table']")
For Each row As HtmlNode In table.SelectNodes(".//tr")
Dim inner_text As String = row.InnerHtml.Trim()
Next
My inner_text
for each row looks like this, with different years and data:
"<th scope="row">2015<!-- --> RG Journal Impact</th><td>6.33</td>"
Each row has a th
element and a td
element and I have tried different ways to pull the value but I can't seem to pull them one after the other by looping the column collection. How can I pull just the th
element and the td
element using the correct Xpath syntax ?
Until I can use better code I'll use standard parsing functions:
Dim hname As String = row.InnerHtml.Trim()
Dim items() As String = hname.Split("</td>")
Dim year As String = items(1).Substring(items(1).IndexOf(">") + 1)
Dim value As String = items(4).Substring(items(4).IndexOf(">") + 1)
If value.ToLower.Contains("available") Then
value = ""
End If
You can carry on with querying the row:
Option Infer On
Option Strict On
Imports HtmlAgilityPack
Module Module1
Sub Main()
Dim h = "<html><head><title></title></head><body>
<table class=""table"">
<tr><th scope=""row"">2015<!-- --> RG Journal Impact</th><td>6.33</td></tr>
<tr><th scope=""row"">2018 JIR</th><td>9.99</td></tr>
</table>
</body></html>"
Dim doc = New HtmlAgilityPack.HtmlDocument()
doc.LoadHtml(h)
Dim table = doc.DocumentNode.SelectSingleNode("//table[@class='table']")
For Each row In table.SelectNodes(".//tr")
Dim yearData = row.SelectSingleNode(".//th").InnerText.Split(" "c)(0)
Dim value = row.SelectSingleNode(".//td").InnerText
Console.WriteLine($"Year: {yearData} Value: {value}")
Next
Console.ReadLine()
End Sub
End Module
Outputs:
Year: 2015 Value: 6.33
Year: 2018 Value: 9.99