Search code examples
c#html-parsinghtml-agility-pack

Getting InnerText ignoring script node by using Html Agility Pack in C#


I have following page from which I want to get a list of proxy servers from a table:

http://proxy-list.org/spanish/search.php?search=&country=any&type=any&port=any&ssl=any

Each row in the table is an ul element. My problem is when obtaining the first li element which associated class is "proxy" from the ul element. I want to obtain the IP and Port so I perform an InnerText but as li element has an script child node, it returns the text of the script node.

Below an image of the structure of the page:

enter image description here

I have tried below code using Html Agility Pack and LINQ:

WebClient webClient = new WebClient();
string page = webClient.DownloadString("http://proxy-list.org/spanish/search.php?search=&country=any&type=any&port=any&ssl=any");

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);

List<List<string>> table = doc.DocumentNode.SelectSingleNode("//div[@class='table']")
            .Descendants("ul")
            .Where(ul => ul.Elements("li").Count() > 1)
            .Select(ul => ul.Elements("li").Select(li =>
                {
                    string result = string.Empty;
                    if (li.HasClass("proxy"))
                    {
                        HtmlNode liTmp = li.Clone();
                        liTmp.RemoveAllChildren();
                        result = liTmp.InnerText.Trim();
                    }
                    else
                    {
                        result = li.InnerText.Trim();
                    }
                    return result;
                }).ToList()).ToList();

I can obtain a list which each item is a list containing the fields (Proxy, País, Tipo, Velocidad, HTTPS/SSL) but field proxy is always empty. Also I am not getting at all the "País" and "Ciudad" columns.


Solution

  • That is because those values are injected into the DOM by JavaScript after page load. Actually the value inside the Proxy() is a Base64 representation of what you are looking for.

    In the image you have posted above the value MTQ4LjI0My4zNy4xMDE6NTMyODE= decodes to 148.243.37.101:53281

    The raw parsed string you are feeding to the Agility pack only contains the Proxy field...

        <div class=\ "table-wrap\">\r\n
            <div class=\ "table\">\r\n
                <ul>\r\n
                    <li class=\ "proxy\">
                        <script type=\ "text/javascript\">
                            Proxy('MTM4Ljk3LjkyLjI0OTo1MzgxNg==')
                        </script>
                    </li>\r\n
                    <li class=\ "https\">HTTP</li>\r\n
                    <li class=\ "speed\">29.5kbit</li>\r\n
                    <li class=\ "type\">
                        <strong>Elite</strong>
                    </li>\r\n
                    <li class=\ "country-city\">\r\n
                        <div>\r\n
                            <span class=\ "country\" title=\ "Brazil\">
                                <span class=\ "country-code\">
                                    <span class=\ "flag br\"></span>
                                    <span class=\ "name\">BR Brasil</span>
                                </span>
                            </span>
                            <!--\r\n                     -->
                            <span class=\ "city\">
                                <span>Rondon</span>
                            </span>\r\n </div>\r\n </li>\r\n </ul>\r\n
                <div class=\ "clear\"></div>\r\n
    

    Using the following code:

            HttpClient client = new HttpClient();
            var docResult = client.GetStringAsync("http://proxy-list.org/spanish/search.php?search=&country=any&type=any&port=any&ssl=any").Result;
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(docResult);
            Regex reg = new Regex(@"Proxy\('(?<value>.*?)'\)", RegexOptions.Compiled | RegexOptions.IgnoreCase);
    
            var stuff = doc.DocumentNode.SelectSingleNode("//div[@class='table']")
            .Descendants("li")
            .Where(x => x.HasClass("proxy"))
            .Select(li =>
            {
                return li.InnerText;
            }).ToList();
    
            foreach (var item in stuff)
            {
                var match = reg.Match(item);
                var proxy = Encoding.Default.GetString(System.Convert.FromBase64String(match.Groups["value"].Value));
                Console.WriteLine($"{item}\t\tproxy = {proxy}");
            }
    

    I get: enter image description here