Search code examples
javahttpcurljsoup

HTTP CURL works - Java Jsoup doesn't


I try to scrape some chat messages from a site (https://bs.to), but I have to login first via HTTP POST. In CURL my code works fine:

curl -v -X POST ^
-H "Cookie: __bsduid=226mq3kt8oafl5f1le1hv3ognl; " ^
-d "login[user]=RainbowSimon&login[pass]=MY_PASSWORD&security_token=687f7de7247f9a95f7fccc6a" "https://bs.to" ^
--output "out.txt"

But then when I tried to get it into Java with JSoup, I get status code 200 and a HTML structure, but I'm not logged in

Connection.Response loggedIn;
loggedIn = Jsoup.connect("http://bs.to")
    .cookie("__bsduid", cookieUID)
    .data("login[user]", loginUserName)
    .data("login[pass]", loginUserPassword)
    .data("security_token", securityTokenForm)
    .method(Method.POST)
    .execute();

System.out.println(loggedIn.statusCode());
System.out.println(loggedIn.parse());

I did even retrieve the security_token and the cookie from the Java application and put them in CURL and it worked too.

Does someone see the mistake I made when trying to implement to Java?


Solution

  • You get different responses because you send different request. The main difference here are headers.

    Web browsers and curl are automatically setting for you some basic request headers but Jsoup won't do this. You have to explicitly add them to the connection. You're using curl with -v so they are already visible:

    > POST / HTTP/2
    > Host: bs.to
    > User-Agent: curl/7.60.0
    > Accept: */*
    > Cookie: __bsduid=226mq3kt8oafl5f1le1hv3ognl;
    > Content-Length: 88
    > Content-Type: application/x-www-form-urlencoded
    

    Jsoup won't set headers: User-Agent, Accept and Content-Type. Some of them are required by some servers to tell the difference between real web browsers and crawlers. Try to set them to exactly the same values as above using .header(name, value) to simulate the same request.
    The other difference between curl and Jsoup is that curl seems to be using HTTP2 but Jsoup uses HTTP1.1 but that shouldn't be the case. To make sure try using curl with --http1.1 switch.
    None of the above can be tested by me because your cookies don't work for me so you have to experiment by yourself.