Search code examples
c#tagshtml-agility-pack

HTML Agility Pack <pre> tag


I'm trying to scrape website that has a "pre" tag using the HTML Agility Pack in C#. I can find plenty of "table tr td" examples but cannot find any "pre"examples. Here is my code with the formatted text "pre" inline.

private void PreformattedTextButton_Click(object sender, EventArgs e)
    {
        var url = @"http://www.thepredictiontracker.com/basepred.php";
        var data = new MyWebClient().DownloadString(url);
        var doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(data);

        //            m _        a _        e d     d d     d d     d l     n
        //e       h d       v r    1     2     3     4     5     6     2     s

        //  BAL D.BUNDY TAM C.ARCHER     7.5  7.48  8.08  7.00  5.58  4.70.     .    6.46
        //  CIN H.BAILEY ATL S.NEWCOMB    9.0  9.72 10.08 10.00 11.62 11.51.     .   10.73

        foreach (HtmlNode pre in doc.DocumentNode.SelectNodes("//pre"))
        {
            textBox1.Text += pre.InnerText + System.Environment.NewLine;
        }
    }

I want to capture the lines that look like the 3rd and 4th lines ignoriing the preceeding lines.

The foreach is executed, but it has pre.InnerText.Length of 1642 which is the total of the pre-formatted text. I want to capture a line of data. e.g. Line 3 & 4.


Solution

  • By definition <pre> tags are preformatted text, so you need to parse the InnerText property yourself. The sample you gave above is consistently formatted, so split the InnerText into a collection of lines, and then use a Regex to capture the lines you want. Tested and working code example:

    var url = @"http://www.thepredictiontracker.com/basepred.php";
    HtmlDocument doc = new HtmlWeb().Load(url);
    var regexMatch = new Regex(
        @"^\s*[A-Z]{3}\s+[A-Z]\.[A-Z]+\s+[A-Z]{3}", 
        RegexOptions.Compiled
    );
    foreach (HtmlNode pre in doc.DocumentNode.SelectNodes("//pre"))
    {
        foreach (var line in pre.InnerText.Split(new char[] { '\r', '\n' }))
        {
            if (regexMatch.IsMatch(line))
            {
                Console.WriteLine(line.Trim());
            }
        }
    }