Search code examples
androidandroid-asynctaskjsoupscreen-scraping

Scraping google search first page using Jsoup with AsyncTask fails?


I've been using Jsoup in order to fetch certain words from google search but it fails to my understanding in the Jsoup query process.

It's getting successfully into the doInBackground method but it won't print the title and body of each link on the search.

My guess is that the list I'm getting from doc.select (links) is empty. which brings it to query syntax problem

value - it's the keyword search, in my case, it's a barcode that actually works. Here's the link

Here it's the async call from another class:

String url = "https://www.google.com/search?q=";

     if (!value.isEmpty())
     {
         url = url + value + " price" + "&num10";
         Scrape_Asynctasks task = new Scrape_Asynctasks();
         task.execute(url);
     }

and here is the async task itself:

public class Scrape_Asynctasks extends AsyncTask<String, Integer, String>
{
    @Override
    protected void onPreExecute() {
        super.onPreExecute();
    }

    @Override
    protected String doInBackground(String... strings) {
        try
        {
            Log.i("IN", "ASYNC");

            final Document doc = Jsoup
                .connect(strings[0])
                .userAgent("Jsoup client")
                .timeout(5000).get();

            Elements links = doc.select("li[class=g]");

            for (Element link : links)
            {
                Elements titles = link.select("h3[class=r]");
                String title = titles.text();

                Elements bodies = link.select("span[class=st]");
                String body = bodies.text();

                Log.i("Title: ", title + "\n");
                Log.i("Body: ", body);
            }
        }

        catch (IOException e)
        {
            Log.i("ERROR", "ASYNC");
        }
        return "finished";
    }

    @Override
    protected void onProgressUpdate(Integer... values) {
        super.onProgressUpdate(values);
    }

    @Override
    protected void onPostExecute(String s) {
        super.onPostExecute(s);
    }
}

Solution

    1. Don't use "Jsoup client" as your user agent string. Use the same string as your browser, eg. "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0". Some sites (including google) don't like it.
    2. Your first selector should be .g: Elements links = doc.select(".g");
    3. The sites uses javascript, so you will not get all the results as you get in your browser.
      You can disable JS in your browser and see the difference.