Search code examples
c#string-parsing

Parse C# HTML String WITHOUT html parsers like AgilityPack


I have an HTML Table as below:

<table border='1' width='100%'>
<tr>
<td>
<table border='1' width='100%'>
<tr>
    <th>
        <p>Title2</p>
    </th>
</tr>
<tr>
    <th>
        <div>Content2</div>
    </th>
</tr>
</table>
</td>

<td>
<table border='1' width='100%'>
<tr>
    <th>
        <p>Hello Title1</p>
    </th>
</tr>
<tr>
    <th>
        <div>Hello content 1</div>
    </th>
</tr>
</table>
</td>
</tr>
</table>

I am making a Windows application to read all titles and show them in the list. When the user presses any title from the list it needs to show the content of selected table.

Q: How can I read all the titles and display them without using HTMLAgilityPack or any other parsers?

So far I have done this:

        WebClient wc = new WebClient();
        System.IO.Stream stream = wc.OpenRead(strFilePath);
        StreamReader sReader = new StreamReader(stream);
        string strTables = sReader.ReadToEnd();
        if (!string.IsNullOrEmpty(strTables))
        { 
            //code to parse html tables
        }

As you have noticed title is inside the <p> element and content is inside the <div> element. Any ideas?


Solution

  • Even though it's not the best practice to parse HTML's with Regex, it is and option:

    Patterns:

    <p>.*</p>
    <div>.*</div>
    

    Example:

        WebClient wc = new WebClient();
        System.IO.Stream stream = wc.OpenRead(strFilePath);
        StreamReader sReader = new StreamReader(stream);
        string strTables = sReader.ReadToEnd();
        if (!string.IsNullOrEmpty(strTables))
        { 
            // I'm not a regex master but I'm sure there is a way to get the title without the <p> elements.
            var pMatches = Regex.Matches(strTables, "<p>.*</p>"));
            foreach(var pMatch in pMatches)
            {
               string title = pMatch.Replace('<p>',string.Empty).Replace('</p>', string.Empty);
            }
        }