Search code examples
cookieshttpsjsoupscreen-scraping

jsoup does not send cookies from previous requests - bug?


I am doing a bit of web scraping of my bank account. All requests are to the same domain. I started in such manner: res = Jsoup.connect().cookies(res.cookies()) in all except the first request. Cookies should be reused, some are added between requests. There are some POST and GET requests, user-agent and some headers are set.

I was getting error 401, which means credentials problem - Fiddler has shown that Jsoup is not sending cookies in the last request. There is no sign that server asks to delete some cookies, also the website is working fine in browser, so I supposed problem was on my side.

Surprisingly, when I save the cookies to map and attach them to this request, everything is working OK. I cannot provide exact data publicly since it's my bank account, but I can provide cookies/captured network packets for developer.

Is it a bug? Here's my code:

import java.io.IOException;
import java.util.Map;

import org.jsoup.Connection.Method;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;



public class Test {

/**
 * @param args
 * @throws IOException 
 * @throws UnirestException 
 */
public static void main(String[] args) throws IOException {


    String userAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1";


    //get login page
    Response res = Jsoup
        .connect("https://example.com/")
        .userAgent(userAgent)
        .execute();




    //send login
    res = Jsoup
        .connect("https://example.com/login")
        .userAgent(userAgent)
        .cookies(res.cookies())
        .data("redirect", "/")
        .data("login", "1234")
        .method(Method.POST)
        .execute();

    //System.out.print(res.body());



    //send password
    res = Jsoup
        .connect("https://example.com/login")
        .userAgent(userAgent)
        .cookies(res.cookies())
        .data("redirect", "/")
        .data("user", "1234")
        .data("password", "1234")
        .method(Method.POST)
        .execute();

    //System.out.print(res.body());







    Map<String, String> cookies = res.cookies();

    //json
    //here cookies are sent properly
    res = Jsoup
        .connect("https://example.com/0/0/list.json?d=1451669517333")
        .userAgent(userAgent)
        .cookies(res.cookies())
        .method(Method.GET)
        .ignoreContentType(true)
        .execute();

    System.out.print(res.body());


    //json      
    //here is the problem with cookies - fix is to use Map of cookies from above
    res = Jsoup
        .connect("https://example.com/ord/0/0?a=23000&d=1451669539678")
        .userAgent(userAgent)
        .cookies(cookies)
        .header("Host", "example.com")
        .header("Connection", "keep-alive")
        .header("Accept", "application/json, text/plain, */*")
        .header("X-Requested-With", "XMLHttpRequest")
        .header("Referer", "https://example.com/")
        .header("Accept-Encoding", "gzip, deflate, lzma, sdch")
        .header("Accept-Language", "pl,en-US;q=0.8,en;q=0.6,de;q=0.4")
        .method(Method.GET)
        .ignoreContentType(true)
        .execute();

    System.out.print(res.body());

}

}

Solution

  • Since it seems that the second but last answer does not return any cookies, you can't use that response as source for the cookies for the final query. JSoup does not automagically handle cookies for you. In each request you need to specify the cookies to send along - as you do. But you also overwrite the variable res each time with a new response. If you do not save the cookies of the connection in a map the old cookies are deleted together with the responses. So your approach with the map is perfectly valid and I would keep using this pattern.

    If you want a more automatic cookie management I would suggest using the Apache httpClient library.