Search code examples
javaeclipseweb-scrapingjsoup

I cannot open a link with jsoup


I am trying to get the descriptions of grants from the Carnegie Foundation and to do so, I need to open links to get the Document. This works on browsers, but not when I use JSoup.connect(url).get() on Eclipse. My code works with other links but not these type specifically. Is there a work-around? One of the links is "https://www.carnegie.org/grants/grants-database/grant/680882743.0/".

try {           
        currentDoc = Jsoup.connect(url).get();
    } catch (IOException e) {
        throw new IllegalArgumentException("URL cannot be reached");
    } catch (Exception e) {
        throw new RuntimeException();
    }

Solution

  • The link you are trying to access is returning a JSON document that contains HTML content. This is different from regular pages that return a HTML document. Jsoup.connect expects a HTML document.

    In order to handle this scenario you need to:

    1. Retrieve the JSON document
    2. Extract the HTML content from the result JSON property
    3. Parse the HTML content using Jsoup.parse

    You may want to treat the HTML content as a fragment, rather than a document by using Jsoup.parse(htmlContent, "", Parser.xmlParser()).

    If you're traversing a website and need to write code that can handle both HTML and JSON documents, I suggest the following workflow:

    1. Use URLConnection to retrieve the data.
    2. Check the content-type header in the response.
    3. If the Content Type is application/json, extract the HTML content from the result property in the response payload or else assume the entire response payload is HTML.
    4. Parse the result from the previous step using Jsoup.parse

    Note that this code assumes that every JSON document has a property called result with the HTML content. This may be enough for your specific use case, but is definitely not a valid assumption for all JSON documents out there.