Search code examples
javahttp-redirectrefreshjsoupmeta

Jsoup meta refresh redirect


I want to get an HTML page from a meta refresh redirect very similar as in question can jsoup handle meta refresh redirect.

But I can't get it to work. I want to do a search on http://synchronkartei.de. I have the following code:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class SynchronkarteiScraper {
  public static void main(String[] args) throws Exception{
    Document doc = Jsoup.connect("https://www.synchronkartei.de/search.php")
                                        .data("cat", "2")
                                        .data("search", "Thomas Danneberg")
                                        .data("action", "search")
                                        .followRedirects(true)
                                        .get();
    Elements meta = doc.select("html head meta");                                  
    for (final Element m : meta){
      if (m.attr("http-equiv").contains("refresh")){
        doc = Jsoup.connect(m.baseUri()+m.attr("content").split("=")[1]).get();
      }
    }

    System.out.println(doc.body().toString());
  }
}

This does the search, which leads to a temporary site that gets refreshed opens the real result page. It is the same as going to http://synchronkartei.de, selecting "Sprecher" from the dropdownbox, entering "Thomas Danneberg" to the textfield and hitting enter.

But even after extracting the refresh URL and do a second connect, I still get the content of the temporary landing page, which can be seen in the prinln of the body.

So what is going wrong here?

As a note, the site synchronkartei.de always redirects to HTTPS. And since it is using a certificate from StartCom, java complains about the certificate path. To let the above code snippet work, it is necessary to use the VM parameter -Djavax.net.ssl.trustStore=<path-to-keystore> with the correct certificate.


Solution

  • I have to admit, that I am no expert in Jsoup, but I know some details about the Synchronkartei, though.

    Deutsche Synchronkartei supports OpenSearchDescriptions, which is linked at /search.xml. That said, you could also use https://www.synchronkartei.de/search.php?search={searchTerms} to get your search term into the session.

    All you need is a cookie "sid" with the session ID, the Synchronkartei provides you. After that, a direct request to https://www.synchronkartei.de/index.php?action=search will provide you the results, regardless of your referrer.

    What I mean is, first send a request to https://www.synchronkartei.de/search.php?search={searchTerms} or https://www.synchronkartei.de/search.php?cat={Category}&search={searchTerms}&action=search (as you did above) and ignore the result completely if it has an HTTP result of 200, but safe the session cookie. After that, you place a request to https://www.synchronkartei.de/index.php?action=search which should provide you the whole list of results then.

    Funzi