I am writing a webscraper that grabs specific urls and adds them to a list.
using HtmlAgilityPack;
List<string> mylist = new List<string>();
var firstUrl = "http://example.com";
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(firstUrl);
HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a");
foreach (HtmlNode htmlNode in (IEnumerable<HtmlNode>)nodes)
{
if (!mylist.Contains(htmlNode.InnerText))
{
mylist.Add(htmlNode.InnerText);
}
}
What I want to do at this point is to loop through 'mylist' and do the exact same thing and basically continue forever. The code should be taking newly parsed URLs and adding them to the list. What would be the easiest way to do this?
I tried creating a for loop right after the one above. But it does not seem to be updating the list. It will only continue to loop over the same items already in the list forever (since i will always be less than mylist.Count)
for (int i = 0; i < mylist.Count; i++)
{
//the items in mylist are added to the url
var urls = "http://example.com" + mylist[i];
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(urls);
HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a");
foreach (HtmlNode htmlNode in (IEnumerable<HtmlNode>)nodes)
{
if (!mylist.Contains(htmlNode.InnerText))
{
mylist.Add(htmlNode.InnerText);
}
}
}
Thanks!
Queue
fit for your requirement.
Queue<string> mylist = new Queue<string>();
First pass :
using HtmlAgilityPack;
Queue<string> mylist = new Queue<string>();
var firstUrl = "http://example.com";
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(firstUrl);
HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a");
foreach (HtmlNode htmlNode in (IEnumerable<HtmlNode>)nodes)
{
if (!mylist.Contains(htmlNode.InnerText))
{
mylist.Enqueue(htmlNode.InnerText);
}
}
Now the second pass
while (mylist.Count > 0)
{
var url = mylist..Dequeue();
//the items in mylist are added to the url
var urls = "http://example.com" + url;
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(urls);
HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a");
foreach (HtmlNode htmlNode in (IEnumerable<HtmlNode>)nodes)
{
if (!mylist.Contains(htmlNode.InnerText))
{
mylist.Enqueue(htmlNode.InnerText);
}
}
}