I have an HTML Table
as below:
<table border='1' width='100%'>
<tr>
<td>
<table border='1' width='100%'>
<tr>
<th>
<p>Title2</p>
</th>
</tr>
<tr>
<th>
<div>Content2</div>
</th>
</tr>
</table>
</td>
<td>
<table border='1' width='100%'>
<tr>
<th>
<p>Hello Title1</p>
</th>
</tr>
<tr>
<th>
<div>Hello content 1</div>
</th>
</tr>
</table>
</td>
</tr>
</table>
I am making a Windows application to read all titles and show them in the list. When the user presses any title from the list it needs to show the content of selected table.
Q: How can I read all the titles and display them without using HTMLAgilityPack
or any other parsers?
So far I have done this:
WebClient wc = new WebClient();
System.IO.Stream stream = wc.OpenRead(strFilePath);
StreamReader sReader = new StreamReader(stream);
string strTables = sReader.ReadToEnd();
if (!string.IsNullOrEmpty(strTables))
{
//code to parse html tables
}
As you have noticed title is inside the <p>
element and content is inside the <div>
element. Any ideas?
Even though it's not the best practice to parse HTML's with Regex, it is and option:
Patterns:
<p>.*</p>
<div>.*</div>
Example:
WebClient wc = new WebClient();
System.IO.Stream stream = wc.OpenRead(strFilePath);
StreamReader sReader = new StreamReader(stream);
string strTables = sReader.ReadToEnd();
if (!string.IsNullOrEmpty(strTables))
{
// I'm not a regex master but I'm sure there is a way to get the title without the <p> elements.
var pMatches = Regex.Matches(strTables, "<p>.*</p>"));
foreach(var pMatch in pMatches)
{
string title = pMatch.Replace('<p>',string.Empty).Replace('</p>', string.Empty);
}
}