Search code examples
javascriptjavaweb-scrapingxmlhttprequestokhttp

HTTP parser: scraping single page application: many GETs, how to find out when the page ends


I'm trying to parse this site:

https://www.monster.com/jobs/search/?q=java&where=usa&stpage=1

In essense, it's not complicated: it's a single-page-application, you give it keywords, click search and then it displays the results - it starts with displaying only around 29 results. As you scroll down, new results get loaded.

Before loading new results, it sends a GET request to

https://www.monster.com/jobs/search/pagination/?q=java&where=usa&isDynamicPage=true&isMKPagination=true&page=2&total=26

which would result in a JSON reply which is a list of jobs and looks somewhat like this:

{"Title":"Java Developer","TitleLink":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","DatePostedText":"6 days ago","DatePosted":"2020-01-18T12:00","LocationText":"Orlando, FL, 32801","JobViewUrl":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","ImpressionTracking":"data-m_impr_uuid=\"a7320356-70db-46ca-908e-e540f0e74cec\" data-m_impr_a_placement_id=\"JSR2CW\" data-m_impr_s_t=\"t\" data-m_impr_j_p=\"27\" data-m_impr_j_jpm=\"1\" data-m_impr_j_lat=\"28.5418\" data-m_impr_j_long=\"-81.3736\" data-m_impr_j_jawsid=\"418397617\" data-m_impr_j_postingid=\"b55f4409-3858-483a-a2e9-65e254ec1cd2\" data-m_impr_j_jobid=\"215193478\" data-m_impr_j_cid=\"660\" data-m_impr_j_occid=\"11970\" data-m_impr_j_lid=\"385\" data-m_impr_j_jpt=\"1\" data-m_impr_j_pvc=\"monster\" data-m_impr_j_coc=\"xsummittechx\" ","Company":{"Name":"Summit Technologies","HasCompanyAddress":true,"LogoLink":""},"Text":"Java Developer","ApplyType":"ApplyOnline","IsAggregated":"false","JobViewUrlMeta":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","MusangKingId":"215193478","CompanyLogoUrl":"","PrivateBoardIconImageUrl":"","FitIcon":"","FitIconType":""}

enter image description here

and another POST request is sent to

https://ib.adnxs.com/ut/v3

(the v3 request):

enter image description here

where the value 14162549 of tag_id: 14162549 seems to be taken from the above GET request.

So as you scroll down, it sends 1 GET and 1 POST requests, until it doesn't - the scroll ends and so do the requests:

enter image description here

I don't understand how it determines when to stop.

I want to scrape those jobs, and I can do something like sending GETs to

https://www.monster.com/jobs/search/pagination/?q=java&where=usa&isDynamicPage=true&isMKPagination=true&page=N

but I wouldn't know when to stop, because if say, it stops scrolling when &page=12, if I send a request to &page=13, it wouldn't return an empty JSON, on the contrary, it would show some other jobs (maybe less relevant and therefore not visible when scrolling to the bottom).

I use okHttp to send requests, like this:

HttpUrl.Builder urlBuilder = HttpUrl.parse(getUrl()).newBuilder();
urlBuilder.addQueryParameter("page", "1");
String url = urlBuilder.build().toString();

Request request = new Request.Builder()
        .url(url)
        .addHeader("Content-Type", "application/json; charset=utf-8")
        .addHeader("Accept-Language", Locale.US.getLanguage())
        .build();

OkHttpClient client = new OkHttpClient();
Call call = client.newCall(request);
Response response = call.execute();
String responseBody = response.body().string();
System.out.println(responseBody);

Gson gson = new Gson();
List<MonsterJobJson> resultMonster = gson.fromJson(
        responseBody, new TypeToken<List<MonsterJobJson>>() {
        }.getType());

Solution

  • Not enough reputation to just comment.

    You might look at div.mux-search-results. It seems to have some attributes that describe how to load more results, and total number of results to display per page and total. Some attributes that seemed related listed below;

    • data-results-page="1"

    • Data-results-url="https://www.monster.com/jobs/search/pagination/?q=java&amp;where=usa&amp;stpage=1&amp;isDynamicPage=true&amp;isMKPagination=true"

    • data-results-per-page="25"

    • data-results-total="250"
    • data-total-search-results="61503"
    • data-results-max="250"