Search code examples
c#webclient

C# parsing web site with ajax loaded content


If I recive a web site with this function I get the whole page, but without the ajax loaded values.

htmlDoc.LoadHtml(new WebClient().DownloadString(url));

Is it possible to load the web site like in gChrome with all values?


Solution

  • You can use a WebBrowser control to get and render the page. Unfortunately, the control uses Internet Explorer and you have to change a registry value in order to force it to use the latest version and even then the implementation is very brittle.

    Another option is to take a standalone browser engine like WebKit and make it work in .NET. I found a page explaining how to do this, but it's pretty dated: http://webkitdotnet.sourceforge.net/basics.php

    I worked on a little demo app to get the content and this is what I came up with:

        class Program
        {
            static void Main(string[] args)
            {
                GetRenderedWebPage("https://siderite.dev", TimeSpan.FromSeconds(5), output =>
                {
                    Console.Write(output);
                    File.WriteAllText("output.txt", output);
                });
                Console.ReadKey();
            }
    
            private static void GetRenderedWebPage(string url, TimeSpan waitAfterPageLoad, Action<string> callBack)
            {
                const string cEndLine= "All output received";
    
                var sb = new StringBuilder();
                var p = new PhantomJS();
                p.OutputReceived += (sender, e) =>
                {
                    if (e.Data==cEndLine)
                    {
                        callBack(sb.ToString());
                    } else
                    {
                        sb.AppendLine(e.Data);
                    }
                };
                p.RunScript(@"
    var page = require('webpage').create();
    page.viewportSize = { width: 1920, height: 1080 };
    page.onLoadFinished = function(status) {
        if (status=='success') {
            setTimeout(function() {
                console.log(page.content);
                console.log('" + cEndLine + @"');
                phantom.exit();
            }," + waitAfterPageLoad.TotalMilliseconds + @");
        }
    };
    var url = '" + url + @"';
    page.open(url);", new string[0]);
            }
        }
    

    This uses the PhantomJS "headless" browser by way of the wrapper NReco.PhantomJS which you can get through "reference NuGet package" directly from Visual Studio. I am sure it can be done better, but this is what I did today. You might want to take a look at the PhantomJS callbacks so you can properly debug what is going on. My example will wait forever if the URL doesn't work, for example. Here is a useful link: https://newspaint.wordpress.com/2013/04/25/getting-to-the-bottom-of-why-a-phantomjs-page-load-fails/