I am trying to get images from google
String url = "https://www.google.com/search?site=imghp&tbm=isch&source=hp&q=audi&gws_rd=cr";
org.jsoup.nodes.Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("div.isv-r.PNCib.MSM1fd.BUooTd");
ImageData is encoded in base64 so in order to get actual image url I first get the data id which is set as an attribute , this works
for (Element element : elements) {
String id = element.attr("data-id")).get();
I need to make new connection with url+"#imgrc="+id
,
org.jsoup.nodes.Document imgdoc = Jsoup.connect(url+"#"+id).get();
Now in the browser when I inspect my required data is present inside <div jsname="CGzTgf">
, so I also do the same in Jsoup
Elements images = imgdoc.select("div[jsname='CGzTgf']");
//futher steps
But images always return empty , I am unable to find the error , I do this inside new thread in android , any help will be appreciated
Turns out the way you're doing it you'll be looking in the wrong place entirely. The urls are contained within some javascript <script>
tag included in the response.
I've extracted and filtered fro the relevant <script>
tag (one containing attribute nonce
.
I then filter those tags for one containing a specific function name used AND a generic search string I'm expecting to find (something that won't be in the other <script>
tags).
Next, the value obtained needs to be stripped to get the JSON object containing about a hundred thousand arrays. I've then navigated this (manually), to pull out a subset of nodes containing relevant URL nodes. I then filter this again to get a List<String>
to get the full URLs.
Finally I've reused some code from an earlier solution here: https://stackoverflow.com/a/63135249/7619034 with something similar to download images.
You'll then also get some console output detailing which URL ended up in which file id. Files are labeled image_[x].jpg
regardless of actual format (so you may need to rework it a little - Hint: take file extension from url if provided).
import com.jayway.jsonpath.JsonPath;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.List;
public class GoogleImageDownloader {
private static int TIMEOUT = 30000;
private static final int BUFFER_SIZE = 4096;
public static final String RELEVANT_JSON_START = "AF_initDataCallback(";
public static final String PARTIAL_GENERIC_SEARCH_QUERY = "/search?q";
public static void main(String[] args) throws IOException {
String url = "https://www.google.com/search?site=imghp&tbm=isch&source=hp&q=audi&gws_rd=cr";
Document doc = Jsoup.connect(url).get();
// Response with relevant data is in a <script> tag
Elements elements = doc.select("script[nonce]");
String jsonDataElement = getRelevantScriptTagContainingUrlDataAsJson(elements);
String jsonData = getJsonData(jsonDataElement);
List<String> imageUrls = getImageUrls(jsonData);
int fileId = 1;
for (String urlEntry : imageUrls) {
try {
writeToFile(fileId, makeImageRequest(urlEntry));
System.out.println(urlEntry + " : " + fileId);
fileId++;
} catch (IOException e) {
e.printStackTrace();
}
}
}
private static String getRelevantScriptTagContainingUrlDataAsJson(Elements elements) {
String jsonDataElement = "";
int count = 0;
for (Element element : elements) {
String jsonData = element.data();
if (jsonData.startsWith(RELEVANT_JSON_START) && jsonData.contains(PARTIAL_GENERIC_SEARCH_QUERY)) {
jsonDataElement = jsonData;
// IF there are two items in the list, take the 2nd, rather than the first.
if (count == 1) {
break;
}
count++;
}
}
return jsonDataElement;
}
private static String getJsonData(String jsonDataElement) {
String jsonData = jsonDataElement.substring(RELEVANT_JSON_START.length(), jsonDataElement.length() - 2);
return jsonData;
}
private static List<String> getImageUrls(String jsonData) {
// Reason for doing this in two steps is debugging is much faster on the smaller subset of json data
String urlArraysList = JsonPath.read(jsonData, "$.data[31][*][12][2][*]").toString();
List<String> imageUrls = JsonPath.read(urlArraysList, "$.[*][*][3][0]");
return imageUrls;
};
private static void writeToFile(int i, HttpURLConnection response) throws IOException {
// opens input stream from the HTTP connection
InputStream inputStream = response.getInputStream();
// opens an output stream to save into file
FileOutputStream outputStream = new FileOutputStream("image_" + i + ".jpg");
int bytesRead = -1;
byte[] buffer = new byte[BUFFER_SIZE];
while ((bytesRead = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, bytesRead);
}
outputStream.close();
inputStream.close();
System.out.println("File downloaded");
}
// Could use JSoup here but I'm re-using this from an earlier answer
private static HttpURLConnection makeImageRequest(String imageUrlString) throws IOException {
URL imageUrl = new URL(imageUrlString);
HttpURLConnection response = (HttpURLConnection) imageUrl.openConnection();
response.setRequestMethod("GET");
response.setConnectTimeout(TIMEOUT);
response.setReadTimeout(TIMEOUT);
response.connect();
return response;
}
}
Partial Result I tested with:
I've used JsonPath for filtering the relevant nodes which is good when you only care about a small portion of the JSON and don't want to deserialise the whole object. It follows a similar navigation style to DOM/XPath/jQuery navigation.
Apart from this one library and Jsoup, the libraries used are very bog standard.
Good Luck!