Search code examples
c#html-agility-packmeta-searchsearch-engine-api

get links from search engines in c#


first of all excuse me for my broken english
i want to code a metasearch engine first i try to use google bing and yahoo api s but theye were limited
then i'm trying to use htmlagility pack to gain results link of search engines
i have this code

using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.IO;
using System.Linq;
using System.Net;
using System.ServiceModel.Syndication;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.Xml;

namespace Search
{
public partial class Form1 : Form
{
    // load snippet
    HtmlAgilityPack.HtmlDocument htmlSnippet = new HtmlAgilityPack.HtmlDocument();

    public Form1()
    {
        InitializeComponent();
    }

    private void btn1_Click(object sender, EventArgs e)
    {
        listBox1.Items.Clear();
        StringBuilder sb = new StringBuilder();
        byte[] ResultsBuffer = new byte[8192];
        string SearchResults = "http://google.com/search?q=" + txtKeyWords.Text.Trim();
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(SearchResults);
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();

        Stream resStream = response.GetResponseStream();
        string tempString = null;
        int count = 0;
        do
        {
            count = resStream.Read(ResultsBuffer, 0, ResultsBuffer.Length);
            if (count != 0)
            {
                tempString = Encoding.ASCII.GetString(ResultsBuffer, 0, count);
                sb.Append(tempString);
            }
        }

        while (count > 0);
        string sbb = sb.ToString();

        HtmlAgilityPack.HtmlDocument html = new HtmlAgilityPack.HtmlDocument();
        html.OptionOutputAsXml = true;
        html.LoadHtml(sbb);
        HtmlNode doc = html.DocumentNode;

        foreach (HtmlNode link in doc.SelectNodes("//a[@href]"))
        {
            //HtmlAttribute att = link.Attributes["href"];
            string hrefValue = link.GetAttributeValue("href", string.Empty);
            if (!hrefValue.ToString().ToUpper().Contains("GOOGLE") && hrefValue.ToString().Contains("/url?q=") && hrefValue.ToString().ToUpper().Contains("HTTP://"))
            {
                int index = hrefValue.IndexOf("&");
                if (index > 0)
                {
                    hrefValue = hrefValue.Substring(0, index);
                    listBox1.Items.Add(hrefValue.Replace("/url?q=", ""));
                }
            }
        }
    }
}

}

can i use this code for all search engines? i changed these lines so it work for other search engines

if (!hrefValue.ToString().ToUpper().Contains("YAHOO") && hrefValue.ToString().Contains("/url?q=") && hrefValue.ToString().ToUpper().Contains("HTTP://"))

and

string SearchResults = "http://yahoo.com/search?q=" + textBox1.Text.Trim();

but it dosent work

My other problem is that this code just return the first page links .what should i do if i want to return N first link?
anybody can help?


Solution

  • First of all you have more than one question in this topic. Please write a topic for each question.

    In the case of Yahoo, "http://yahoo.com/search?q=" is not valid, if you try http://yahoo.com/search?q=stackoverflow you don't get the result page. You have to find the search url for every search engine. For example Yahoo has: https://search.yahoo.com/search?p=.

    You also have to modify this if (!hrefValue.ToString().ToUpper().Contains("YAHOO") && hrefValue.ToString().Contains("/url?q=") && hrefValue.ToString().ToUpper().Contains("HTTP://")) for every search engine. For example you only get HTTP values, however HTTPS are discard.

    Pagination

    Google use &start= for pagination and usually returns 10 results per page. So if you put start=20, you get from 20 to 30 https://www.google.es/search?q=stackoverflow&start=20

    Yahoo also returns 10 results per page and use por pagination &b=. b=1 is the first page, b=11 de second and so on. Example: https://search.yahoo.com/search?p=stackoverflow&b=11

    I hope this can help you.