I'm trying to parse the main (last in the dom tree)
<table>
in this website: "https://aips.um.si/PredmetiBP5/Main.asp?Mode=prg&Zavod=77&Jezik=&Nac=1&Nivo=P&Prg=1571&Let=1" Im using the Htmlagilitypack and writing code in C# on a wpf application in visual studio 17.
Right now im using this code:
iso = Encoding.GetEncoding("windows-1250");
web = new HtmlWeb()
{
AutoDetectEncoding = false,
OverrideEncoding = iso,
};
//http = https://aips.um.si/PredmetiBP5/Main.asp?Mode=prg&Zavod=77&Jezik=&Nac=1&Nivo=P&Prg=1571&Let=1
string http = formatLetnikLink(l.Attributes["onclick"].Value).ToString();
var htmlProgDoc = web.Load(http);
string s = htmlProgDoc.ParsedText;
htmlprogDoc.ParsedText correctly includes all the rows that are supposed to be in the last table (I had this for debugging, just incase the watch window was broken or something... idk...)
I tried to first get all the tables on the tables on the website. And realized that there are 6
<table></table>
tags on it, even tho you visualy see only one. After debuggign for a couple of hours, i realized that the last main table, is the last
<table>
in the dom tree, and that the parser parsing fully all the
<tr>
tags that the table has. This is the problem, I need all the tr tags.
var tables = htmlProgDoc.DocumentNode.SelectNodes("//table");
There are 6 times
<table></table>
tags, as expected, and everyone of them is fully parsed, including all their rows and columns, except the last one, in the last one it only parses the first two rows and then the parser apears to append a
</table>
by its self, I also tried using the direct xpath selector, copy-ed from firefox: "/html/body/div/center[2]/font/font/font/table", instead of "//table" which found the correct table, but the table also contained only the first 2 rows
var theTableINeed = tables.Last();
//contains the correct table which I need, but with only the first two rows
The Html on that page is malformed. One possible workaround is stripping the code for last table and parse it as a document.
var client = new WebClient();
string html = client.DownloadString(url);
int lastTableOpen = html.LastIndexOf("<table");
int lastTableClose = html.LastIndexOf("</table");
string lastTable = html.Substring(lastTableOpen, lastTableClose - lastTableOpen + 8);
Then use HtmlAgilityPack:
var table = new HtmlDocument();
table.LoadHtml(lastTable);
foreach (var row in table.DocumentNode.SelectNodes("//table//tr"))
{
Console.WriteLine(row.ToString());
}
But I don't know if there are problems in the table itself.