I want to get an HTML page from a meta refresh redirect very similar as in question can jsoup handle meta refresh redirect.
But I can't get it to work. I want to do a search on http://synchronkartei.de. I have the following code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class SynchronkarteiScraper {
public static void main(String[] args) throws Exception{
Document doc = Jsoup.connect("https://www.synchronkartei.de/search.php")
.data("cat", "2")
.data("search", "Thomas Danneberg")
.data("action", "search")
.followRedirects(true)
.get();
Elements meta = doc.select("html head meta");
for (final Element m : meta){
if (m.attr("http-equiv").contains("refresh")){
doc = Jsoup.connect(m.baseUri()+m.attr("content").split("=")[1]).get();
}
}
System.out.println(doc.body().toString());
}
}
This does the search, which leads to a temporary site that gets refreshed opens the real result page. It is the same as going to http://synchronkartei.de, selecting "Sprecher" from the dropdownbox, entering "Thomas Danneberg" to the textfield and hitting enter.
But even after extracting the refresh URL and do a second connect, I still get the content of the temporary landing page, which can be seen in the prinln of the body.
So what is going wrong here?
As a note, the site synchronkartei.de always redirects to HTTPS. And since it is using a certificate from StartCom, java complains about the certificate path. To let the above code snippet work, it is necessary to use the VM parameter -Djavax.net.ssl.trustStore=<path-to-keystore>
with the correct certificate.
I have to admit, that I am no expert in Jsoup, but I know some details about the Synchronkartei, though.
Deutsche Synchronkartei supports OpenSearchDescriptions, which is linked at /search.xml. That said, you could also use https://www.synchronkartei.de/search.php?search={searchTerms}
to get your search term into the session.
All you need is a cookie "sid" with the session ID, the Synchronkartei provides you. After that, a direct request to https://www.synchronkartei.de/index.php?action=search
will provide you the results, regardless of your referrer.
What I mean is, first send a request to https://www.synchronkartei.de/search.php?search={searchTerms}
or https://www.synchronkartei.de/search.php?cat={Category}&search={searchTerms}&action=search
(as you did above) and ignore the result completely if it has an HTTP result of 200, but safe the session cookie. After that, you place a request to https://www.synchronkartei.de/index.php?action=search
which should provide you the whole list of results then.
Funzi