Search code examples
c#windows-phone-7web

Web Scrape in c# windows phone


Hi I need to get all the data page. In case the photo and the name of each topic. The page is here.

I know I have two alternatives. With this I can only get an image of the entire page. But if anyone knows complementary to catch everything would be the best way:

int startIndex = e.Result.IndexOf(@"><img");
string result = e.Result;            
result = e.Result.Substring(startIndex, e.Result.Length - startIndex);
startIndex = result.IndexOf(".php?src=") + 9;
int endIndex = result.IndexOf(".jpg", startIndex);
string link = result.Substring(startIndex, endIndex - startIndex) + ".jpg";
MessageBox.Show(link);
imagem.Source = new BitmapImage(new Uri(link));

another way is this. I created a class to hold the data and creating a list, but the string "pattern" must be totally wrong. Because i did not like riding a string of this type. Just copied from another topic and tried to create my own based on this:

private void ConsultaPopularVideos(string uri)
        {
            WebClient web2 = new WebClient();
            web2.DownloadStringAsync(new Uri(uri));
            web2.DownloadStringCompleted += web2_DownloadStringCompleted;
        }

        void web2_DownloadStringCompleted(object sender, DownloadStringCompletedEventArgs e)
        {
            if (!e.Cancelled && e.Error == null && !String.IsNullOrEmpty(e.Result))
            {
                _popVideos = new List<PopularVideos>();
                // Aqui você pega todos os links da página
                // P.S.: Se a página mudar, você tem que alterar o pattern aqui.
                string pattern = @"\<a\shref\=[\""|\'](?<url>[^\""|\']+)[\""|\']\stitle\=[\""|\'](?<title>[^\""|\']+).php?src=[\""|\'](?<img>[^\""|\']+)[\""|\']\s\width='275'";


                // Busca no HTML todos os links
                MatchCollection ms = Regex.Matches(e.Result, pattern, RegexOptions.Multiline);


                Debug.WriteLine("----- OK {0} links encontrados", ms.Count);

                foreach (Match m in ms)
                {
                    // O pattern acima está dizendo onde fica o Url e onde fica o nome do artista
                    // e esses são resgatados aqui
                    Group url = m.Groups["url"];
                    MessageBox.Show(m.Groups.ToString());
                    Group title = m.Groups["title"];
                    Group img = m.Groups["img"];

                    if (url != null && title != null && img != null)
                    {
                        //Debug.WriteLine("author: {0}\nUrl: {1}", author.Value, url.Value);

                        // Se caso tenha encontrado o link do artista (pois há outros links na página) continua
                        if (url.Value.ToLower().IndexOf("/") > -1)
                        {
                            // Adiciona um objeto Artista à lista
                            PopularVideos video = new PopularVideos(title.Value, url.Value, img.Value);
                            _popVideos.Add(video);                            
                        }
                    }
                }
                listBoxPopular.ItemsSource = _popVideos;
            }
        }

Class:

class PopularVideos
    {
        public PopularVideos() { }
        public PopularVideos(string nome, string url, string img)
        {
            Nome = nome;
            Url = new Uri(url);
            BitmapImage Img = new BitmapImage(new Uri(img));
        }
        public string Nome { get; set; }
        public string Img { get; set; }
        public Uri Url { get; set; }
    }

Solution

  • Using regex to scrape data from web page is not a good solution, as it will be unreliable, fragile and difficult to implment. I will recommend to use [htmlagilitypack][http://htmlagilitypack.codeplex.com/] to scrape the data, it is a mature library, support windows phone, i used the tool in my windows phone app, and very happy with it.