Search code examples
javaparsingjsoupgoogle-search-api

How should I modify to parse Google news search article title & preview & URL?


I want to parse the Google news search : 1)article name 2) preview 3) URL

To perform this , I should make modification in web structure.

Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");

mainly here :

( ".g>.r>.a")

How to modify it ?


Full code :

  public static void main(String[] args) throws UnsupportedEncodingException, IOException {

    String google = "http://www.google.com/search?q=";

    String search = "stackoverflow";

    String charset = "UTF-8";

    String news="&tbm=nws";


    String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!

    Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");

    for (Element link : links) {
        String title = link.text();
        String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
        url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");

        if (!url.startsWith("http")) {
            continue; // Ads/news/etc.
        }
        System.out.println("Title: " + title);
        System.out.println("URL: " + url);
    }
}

Update

enter image description here


Solution

  • How to select the right elements (using chrome)

    First step: disable javascript in you browser (for example using a add on like uMatrix for convenience), so you see the same result as jsoup.

    Now right click on a element and choose inspect or open up the dev tools with Ctrl+Shift+I. When you hover over the source code in the Elements tab, you see the related element in the rendered page. Right clicking on a n element in source offers copy -> copy selector. That is a good starting point, but sometimes too strict. Here it gives the selector #rso > div:nth-child(3) so the third direct child div in an element with id rso. That is too specific, so we generalize it:

    We select all direct child divs for the element with id rso #rso > div.

    Then we grab the headline anchor h3 > a, textnode and attribute href results in title and url.

    Next we grab the inner div with class st (div.st), that contains the preview in its textnode. If that div is missing, we will skip that element.

    Using .data("key","value") in the request, we don't need to encode manually.

    Example code

    String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
    String searchTerm = "stackoverflow";
    int numberOfResultpages = 2; // grabs first two pages of search results
    String searchUrl = "https://www.google.com/search?";
    
    Document doc;
    
    for (int i = 0; i < numberOfResultpages; i++) {
    
        try {
            doc = Jsoup.connect(searchUrl)
                    .userAgent(userAgent)
                    .data("q", searchTerm)
                    .data("tbm", "nws")
                    .data("start",""+i)
                    .method(Method.GET)
                    .referrer("https://www.google.com/").get();
    
            for (Element result : doc.select("#rso > div")) {
    
                if(result.select("div.st").size()==0) continue;
    
                Element h3a = result.select("h3 > a").first();
    
                String title = h3a.text();
                String url = h3a.attr("href");
                String preview = result.select("div.st").first().text();
    
                // just printing out title and link to demonstate the approach
                System.out.println(title + " -> " + url + "\n\t" + preview);
            }
    
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
    

    Output

    Stack Overflow: Movie Magic -> https://geekdad.com/2016/09/stack-overflow-movie-magic-2/
        I got to visit the set of Kubo and the Two Strings and see some of the amazing work that went into creating the film. But well before the ...
    Will StackOverflow Documentation Realize Its Lofty Goal? -> https://dzone.com/articles/will-stackoverflow-documentation-realize-its-lofty
        With the StackOverflow Documentation project now in beta, how close is it to realizing the lofty goals it has set forth for itself? Can it ever ...
    Stack Overflow: Progress Report -> https://geekdad.com/2016/09/stack-overflow-progress-report/
        Of the books on my list, the only one I totally finished so far is Kidding Ourselves, which I included in this Stack Overflow. And that perhaps is an ...
    ....