Search code examples
javaauthenticationweb-scrapingjsoup

Java scrape website with login required using Jsoup


I'd like to printsome datas (div with class="news_article") from streetinsider.com. I created an account and I need to log in to access those datas.

Can anyone explain me why this code is not working ? I've tried a lot but nothing is working.

    public static final String SPLIT_INTERNET_URL = "http://www.streetinsider.com/Special+Dividends?offset=55";
public static final String SPLIT_LOGIN = "https://www.streetinsider.com/login.php";

/**
 * @param args the command line arguments
 * @throws java.io.FileNotFoundException
 * @throws java.io.UnsupportedEncodingException
 * @throws java.text.ParseException
 * @throws java.lang.ClassNotFoundException
 */
public static void main(String[] args) throws FileNotFoundException, UnsupportedEncodingException, IOException, ParseException, ClassNotFoundException {
    // TODO code application logic here
    Response res = Jsoup.connect(SPLIT_LOGIN)
            .data("loginemail", "XXXXX", "password", "XXXX")
            .method(Method.POST)
            .execute();
    Document doc = res.parse();

    Map<String, String> cookies = res.cookies();

    Document pageWhenAlreadyLoggedIn = Jsoup.connect(SPLIT_INTERNET_URL).cookies(cookies).get();
    Elements elems = pageWhenAlreadyLoggedIn.select("div[class=news_article]");
    for (Element elem : elems) {
        System.out.println(elem);
    }
}

Solution

  • Your code doesn't log you in to the website....Try the below code to login to the website.

    To login to the website:

    Connection.Response res = Jsoup.connect(SPLIT_LOGIN)
                .data("action", "account", 
                    "redirect", "account_home.php?",
                    "radiobutton", "old", 
                    "loginemail", "XXXXX",
                    "password", "XXXXX", 
                    "LoginChoice", "Sign In to Secure Area")
                .method(Connection.Method.POST)
                .followRedirects(true)
                .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36")
                .execute();
    

    So you are now logged in, however the website seems to detect whether you are logged in in other browser or connection, requests that you terminate that connection first. So below is the code for terminating the connection:

    Connection.Response res2 = Jsoup.connect("http://www.streetinsider.com/login_duplicate.php")
                .data("ok", "End Prior Session")
                .method(Connection.Method.POST)
                .cookies(res.cookies())
                .followRedirects(true)
                .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36")
                .execute();
    

    All good, now res2 will contains the home page of your account, you can then proceed to go to whatever page you want. For more information on how to login to a website with Jsoup, take a look at the following tutorial:

    How to login to a website with Jsoup