Search code examples
androidjsoup

Jsoup attribute selector returning empty


I am trying to get images from google

String url = "https://www.google.com/search?site=imghp&tbm=isch&source=hp&q=audi&gws_rd=cr";
 org.jsoup.nodes.Document doc = Jsoup.connect(url).get();
 Elements elements = doc.select("div.isv-r.PNCib.MSM1fd.BUooTd");

ImageData is encoded in base64 so in order to get actual image url I first get the data id which is set as an attribute , this works

 for (Element element : elements) {
 String id = element.attr("data-id")).get();

I need to make new connection with url+"#imgrc="+id ,

org.jsoup.nodes.Document imgdoc = Jsoup.connect(url+"#"+id).get();

Now in the browser when I inspect my required data is present inside <div jsname="CGzTgf"> , so I also do the same in Jsoup

   Elements images = imgdoc.select("div[jsname='CGzTgf']");
   //futher steps

But images always return empty , I am unable to find the error , I do this inside new thread in android , any help will be appreciated


Solution

  • Turns out the way you're doing it you'll be looking in the wrong place entirely. The urls are contained within some javascript <script> tag included in the response.

    I've extracted and filtered fro the relevant <script> tag (one containing attribute nonce.

    I then filter those tags for one containing a specific function name used AND a generic search string I'm expecting to find (something that won't be in the other <script> tags).

    Next, the value obtained needs to be stripped to get the JSON object containing about a hundred thousand arrays. I've then navigated this (manually), to pull out a subset of nodes containing relevant URL nodes. I then filter this again to get a List<String> to get the full URLs.

    Finally I've reused some code from an earlier solution here: https://stackoverflow.com/a/63135249/7619034 with something similar to download images.

    You'll then also get some console output detailing which URL ended up in which file id. Files are labeled image_[x].jpg regardless of actual format (so you may need to rework it a little - Hint: take file extension from url if provided).

    import com.jayway.jsonpath.JsonPath;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.InputStream;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.util.List;
    
    public class GoogleImageDownloader {
    
        private static int TIMEOUT = 30000;
        private static final int BUFFER_SIZE = 4096;
    
        public static final String RELEVANT_JSON_START = "AF_initDataCallback(";
        public static final String PARTIAL_GENERIC_SEARCH_QUERY = "/search?q";
    
        public static void main(String[] args) throws IOException {
            String url = "https://www.google.com/search?site=imghp&tbm=isch&source=hp&q=audi&gws_rd=cr";
            Document doc = Jsoup.connect(url).get();
    
            // Response with relevant data is in a <script> tag
            Elements elements = doc.select("script[nonce]");
    
            String jsonDataElement = getRelevantScriptTagContainingUrlDataAsJson(elements);
            String jsonData = getJsonData(jsonDataElement);
            List<String> imageUrls = getImageUrls(jsonData);
    
            int fileId = 1;
            for (String urlEntry : imageUrls) {
                try {
                    writeToFile(fileId, makeImageRequest(urlEntry));
                    System.out.println(urlEntry + " : " + fileId);
                    fileId++;
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    
        private static String getRelevantScriptTagContainingUrlDataAsJson(Elements elements) {
            String jsonDataElement = "";
            int count = 0;
            for (Element element : elements) {
                String jsonData = element.data();
                if (jsonData.startsWith(RELEVANT_JSON_START) && jsonData.contains(PARTIAL_GENERIC_SEARCH_QUERY)) {
                    jsonDataElement = jsonData;
                    // IF there are two items in the list, take the 2nd, rather than the first.
                    if (count == 1) {
                        break;
                    }
                    count++;
                }
            }
            return jsonDataElement;
        }
    
        private static String getJsonData(String jsonDataElement) {
            String jsonData = jsonDataElement.substring(RELEVANT_JSON_START.length(), jsonDataElement.length() - 2);
            return jsonData;
        }
    
        private static List<String> getImageUrls(String jsonData) {
            // Reason for doing this in two steps is debugging is much faster on the smaller subset of json data
            String urlArraysList = JsonPath.read(jsonData, "$.data[31][*][12][2][*]").toString();
            List<String> imageUrls = JsonPath.read(urlArraysList, "$.[*][*][3][0]");
            return imageUrls;
        };
    
        private static void writeToFile(int i, HttpURLConnection response) throws IOException {
            // opens input stream from the HTTP connection
            InputStream inputStream = response.getInputStream();
    
            // opens an output stream to save into file
            FileOutputStream outputStream = new FileOutputStream("image_" + i + ".jpg");
    
            int bytesRead = -1;
            byte[] buffer = new byte[BUFFER_SIZE];
            while ((bytesRead = inputStream.read(buffer)) != -1) {
                outputStream.write(buffer, 0, bytesRead);
            }
            outputStream.close();
            inputStream.close();
    
            System.out.println("File downloaded");
        }
    
        // Could use JSoup here but I'm re-using this from an earlier answer
        private static HttpURLConnection makeImageRequest(String imageUrlString) throws IOException {
            URL imageUrl = new URL(imageUrlString);
            HttpURLConnection response = (HttpURLConnection) imageUrl.openConnection();
            response.setRequestMethod("GET");
            response.setConnectTimeout(TIMEOUT);
            response.setReadTimeout(TIMEOUT);
            response.connect();
            return response;
        }
    }
    

    Partial Result I tested with:

    enter image description here

    I've used JsonPath for filtering the relevant nodes which is good when you only care about a small portion of the JSON and don't want to deserialise the whole object. It follows a similar navigation style to DOM/XPath/jQuery navigation.

    Apart from this one library and Jsoup, the libraries used are very bog standard.

    Good Luck!