Search code examples
jsoupscreen-scraping

How to save the body content of New York Times links using jsoup


I have to do screen scraping from some different news websites like Washington Post, NY Times and Yahoo Message Boards. I used jsoup to do so and it works fine with some of those websites like Washington Post. However, when it comes to NY Times, every approach that I've used, was failed. Using such this piece of code just gives me "Log In - The New York Times" as the content.

String html = Jsoup.connect(urlString).maxBodySize(Integer.MAX_VALUE).timeout(600000).get().html(); doc = Jsoup.parse(html); result = doc.title() + "\n"; result += doc.body().text();

I already used cookies and pass them through my requests, but it didn't work as well.

Connection.Response loginForm = Jsoup.connect("https://myaccount.nytimes.com/auth/login")
     .method(Connection.Method.GET).execute();
doc = Jsoup.connect("https://myaccount.nytimes.com/auth/login")
           .data("userid", myEmail).data("password", password)
           .cookies(loginForm.cookies())
           .post();
Map<String, String> loginCookies = loginForm.cookies();
Document doc1 =  Jsoup.connect(urlString).maxBodySize(Integer.MAX_VALUE).timeout(600000)
                      .cookies(loginCookies).get();

Can anyone give me an approach to save body content of NY Times urls?


Solution

  • If you look at the actual data that you are sending during 'normal' login, you'll see thtat besides the cookies, user name and password, the browser also sends fields like 'token', 'expires' and so on, which it gets from the first GET request. Open the developer tools in your browser, and you'll see it.

    You can get these values easily. To get the token, you can use the query - div[class="control hidden"] > input[name=token].
    Consider also to change the user agent of your request, to match the browser you use on your PC - that way you'll get the same response from the site, with same field names etc.
    See a similar question here how-to-loign the website