I'm trying to gather a list of hyperlinks (the url that it links to) using WatIn. I tried using:
foreach (Link l in myIE.Links)
{
Links.Add(l.ToString());
}
string LinksCSV = string.Join(",", Links.ToArray());
richTextBox2.Text = LinksCSV;
I am trying to list all hyperlinks in my richtextbox however the above returned the hyperlink name, so it showed "Link" over and over again.
Additionally I'm going to need to list only urls/links that contain "webpage.php?id=" and then has a unique number after that. How do I return the scraped urls filtered by only the ones that contain "webpage.php?id="?
UPDATE: Here is an updated test that works using other sites, but not my required site. The below code works.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using WatiN.Core;
namespace ScrapeTest
{
class Program
{
[STAThread]
static void Main(string[] args)
{
IE ie = new IE();
ie.GoTo("http://www.freesound.org/browse/tags/organ/");
foreach (var currLink in ie.Links)
{
if (currLink.Url.Contains("sounds"))
{
Console.WriteLine("contains Edit in the link Url" + currLink.Url);
}
}
Console.ReadLine();
}
}
}
The code seems to be correct, however it's interaction with my specific url and hyperlinks seems to be the issue. The site and hyperlinks I'm after contain sensitive information, hence their omission.
Using my sites Main page http://website.com the script runs, so it is having an issue with regards to the unique page I send it to http://website.com/data.php?search=%22%22&cat=0 Could it be because of the .php in the url? Also the url's are stored on the page as shown below if it helps.
td class="alt2">
<a align="center" href="data.php?id=111111">EDIT</a>
/td>
UPDATE and SOLUTION: For some reason the issue seems to occur when I try to use the Url.Contains method. What I have ended up doing is storing every single scraped Url into a list and will test my list line by line as needed to return the required Urls. Thank you so much for your help.
in your code myIE.Links
is a LinkCollection
, meaning when you iterate through the Link
objects you need to specify which property you want, in this case it will be Url
Example - Go to google.com and write out link addresses to the console.
ie.GoTo("http://www.google.com");
System.Threading.Thread.Sleep(5000); //<-- Added due to diagnose what might be a timing issue.
foreach (var currLink in ie.Links)
{
if (currLink.Url.Contains("www.google.com"))
{
Console.WriteLine("contains www.google.com in the link Url" + currLink.Url);
}
}
Tested on WatiN 2.1, IE9, Win7.