I am trying to get the descriptions of grants from the Carnegie Foundation and to do so, I need to open links to get the Document. This works on browsers, but not when I use JSoup.connect(url).get() on Eclipse. My code works with other links but not these type specifically. Is there a work-around? One of the links is "https://www.carnegie.org/grants/grants-database/grant/680882743.0/".
try {
currentDoc = Jsoup.connect(url).get();
} catch (IOException e) {
throw new IllegalArgumentException("URL cannot be reached");
} catch (Exception e) {
throw new RuntimeException();
}
The link you are trying to access is returning a JSON document that contains HTML content. This is different from regular pages that return a HTML document. Jsoup.connect
expects a HTML document.
In order to handle this scenario you need to:
result
JSON propertyJsoup.parse
You may want to treat the HTML content as a fragment, rather than a document by using Jsoup.parse(htmlContent, "", Parser.xmlParser())
.
If you're traversing a website and need to write code that can handle both HTML and JSON documents, I suggest the following workflow:
URLConnection
to retrieve the data.content-type
header in the response.application/json
, extract the HTML content from the result
property in the response payload or else assume the entire response payload is HTML.Jsoup.parse
Note that this code assumes that every JSON document has a property called result
with the HTML content. This may be enough for your specific use case, but is definitely not a valid assumption for all JSON documents out there.