Hi I was making a crawler for a site. After about 3 hours of crawling, my app stopped on a WebException. below are my code in c#. client is predefined WebClient
object that will be disposed every time gameDoc has already been processed. gameDoc is a HtmlDocument
object (from HtmlAgilityPack
)
while (retrygamedoc)
{
try
{
gameDoc.LoadHtml(client.DownloadString(url)); // this line caused the exception
retrygamedoc = false;
}
catch
{
client.Dispose();
client = new WebClient();
retrygamedoc = true;
Thread.Sleep(500);
}
}
I tried to use code below (to keep the webclient fresh) from this answer
while (retrygamedoc)
{
try
{
using (WebClient client2 = new WebClient())
{
gameDoc.LoadHtml(client2.DownloadString(url)); // this line cause the exception
retrygamedoc = false;
}
}
catch
{
retrygamedoc = true;
Thread.Sleep(500);
}
}
but the result is still the same. Then I use StreamReader and the result stays the same! below are my code using StreamReader.
while (retrygamedoc)
{
try
{
// using native to check the result
HttpWebRequest webreq = (HttpWebRequest)WebRequest.Create(url);
string responsestring = string.Empty;
HttpWebResponse response = (HttpWebResponse)webreq.GetResponse(); // this cause the exception
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
responsestring = reader.ReadToEnd();
}
gameDoc.LoadHtml(client.DownloadString(url));
retrygamedoc = false;
}
catch
{
retrygamedoc = true;
Thread.Sleep(500);
}
}
What should I do and check? I am so confused because I got am able to crawl on some pages, on the same site, then in about 1000 reasults, it cause the exception. the message from exception is only The request was aborted: The connection was closed unexpectedly.
and the status is ConnectionClosed
PS. the app is a desktop form app.
update :
Now I am skipping the values and turned them to null so that the crawling can goes on. But if the data is really needed, I still have to update the crawling result manually, which is tiring because the result contains thousands of record. Please help me.
example :
it was like you have downloaded like about 1300 data from the website, then the application stopped saying The request was aborted: The connection was closed unexpectedly.
while all your internet connection still on and on a good speed.
ConnectionClosed
may indicate (and probably does) that the server you're downloading from is closing the connection. Perhaps it is noticing a large amount of requests from your client and is denying you additional service.
Since you can't control server-side shenanigans, I'd recommend you have some sort of logic to retry the download a bit later.