I'm writing a program to scrape lots of websites of companies (up to 100,000) for up-to-date contact information as well as some information about their field of operations in C#. Because most of the websites can't be displayed in the regular .NET webbrowser I'm using geckofx to navigate to these websites and to find the content relevant to me I select nodes with HtmlAgilityPack.
The process is always the same: If I have a URL for a company I visit the website right away, otherwise I use bing to look for a web address (Google seems to dislike being used automatically). On the website I look for a link to an imprint and links to pages that could indicate some area of activity, I navigate to these links and look for catchphrases that I specified beforehand. Everything is running synchronously, I wait for the browser to trigger its DocumentCompleted
event everytime.
An example:
//I navigate to bing looking for my company's name and postal code
Variables.browser.Navigate("https://www.bing.com/search?q=" + c.Name.Replace(" ", "+") + "+" + c.Zip.Replace(" ", "+"));
//I wait for the browser to finish loading. The Navigating event sets BrowserIsReady to false and the DocumentCompleted event sets it to true
do
{
f.Application.DoEvents();
} while (!Variables.BrowserIsReady);
HtmlDocument browserDoc = new HtmlDocument();
browserDoc.LoadHtml(Variables.browser.Document.Body.OuterHtml);
//I select the relevant node in the document
HtmlNode sidebarNode = browserDoc.DocumentNode.SelectSingleNode("//div[contains(concat(\" \", normalize-space(@class), \" \"), \" b_entityTP \")]");
if (sidebarNode != null)
{
Variables.logger.Log("Found readable sidebar. Loading data...");
string lookedUpName, lookedUpStreet, lookedUpCity, lookedUpZip, lookedUpPhone, lookedUpWebsite;
HtmlNode infoNode = sidebarNode.SelectSingleNode("//div[contains(concat(\" \", normalize-space(@class), \" \"), \" b_subModule \")]");
HtmlNode nameNode = infoNode.SelectSingleNode("//div[contains(concat(\" \", normalize-space(@class), \" \"), \" b_feedbackComponent \")]");
if (nameNode != null)
{
string[] dataFacts = nameNode.GetAttributeValue("data-facts", "").Replace("{\"", "").Replace("\"}", "").Split(new string[] { "\",\"" }, StringSplitOptions.None);
foreach (string dataFact in dataFacts)
{
//... abbreviated
}
}
//And at the end of every call to a node object I set it back to null
nameNode = null;
}
My geckofx is not allowed to write cache to memory or to load images from websites, which I set by using
GeckoPreferences.Default["browser.cache.memory.enabled"] = false;
GeckoPreferences.Default["permissions.default.image"] = 2;
before creating my GeckoWebBrowser instance.
After every scraped website I call
//CookieMan is used as a global variable so I don't have to recreate it every time.
private static nsICookieManager CookieMan;
//...
CookieMan = Xpcom.GetService<nsICookieManager>("@mozilla.org/cookiemanager;1");
CookieMan = Xpcom.QueryInterface<nsICookieManager>(CookieMan);
CookieMan.RemoveAll();
Gecko.Cache.ImageCache.ClearCache(true);
Gecko.Cache.ImageCache.ClearCache(false);
Xpcom.GetService<nsIMemory>("@mozilla.org/xpcom/memory-service;1").HeapMinimize(true);
to delete cookies, image cache (which I'm not sure is even created) and to minimize Xulrunners memory usage.
Nevertheless, after starting quite nicely with an approximate runtime of 2-3 seconds per record and comfortable 200-300mb memory usage, both quickly blow up to 16-17 seconds per record and over 2gb of used memory for my crawler alone after 1 hour.
I tried forcing garbage collection with GC.Collect();
(which I know, you're not supposed to do) and even recycling the entire browser object by stopping, disposing and recreating it to try and get rid of unused junk in the memory, but to no avail. I was also trying to shut down the Xulrunner and starting it again, but Xpcom.Shutdown()
seems to stop the entire app, so I wasn't able to do that.
I'm pretty much out of ideas at this point and would very much appreciate new hints to approaches I haven't yet taken.
Have you tried using recycled AppDomains?
AppDomain workerAppDomain = AppDomain.CreateDomain("WorkerAppDomain");
workerAppDomain.SetData("URL", "https://stackoverflow.com");
workerAppDomain.DoCallBack(() =>
{
var url = (string)AppDomain.CurrentDomain.GetData("URL");
Console.WriteLine($"Scraping {url}");
var webClient = new WebClient();
var content = webClient.DownloadString(url);
AppDomain.CurrentDomain.SetData("OUTPUT", content.Length);
});
int contentLength = (int)workerAppDomain.GetData("OUTPUT");
AppDomain.Unload(workerAppDomain);
Console.WriteLine($"ContentLength: {contentLength:#,0}");
Output:
Scraping https://stackoverflow.com
ContentLength: 262.013
The data you pass between the main AppDomain and the worker AppDomain must be serializable.
Update: The most clean solution should be to use separate processes though. This would guarantee that the leakage can be cleaned up reliably.