I'm trying to parse this site:
https://www.monster.com/jobs/search/?q=java&where=usa&stpage=1
In essense, it's not complicated: it's a single-page-application, you give it keywords, click search and then it displays the results - it starts with displaying only around 29 results. As you scroll down, new results get loaded.
Before loading new results, it sends a GET request to
https://www.monster.com/jobs/search/pagination/?q=java&where=usa&isDynamicPage=true&isMKPagination=true&page=2&total=26
which would result in a JSON
reply which is a list of jobs and looks somewhat like this:
{"Title":"Java Developer","TitleLink":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","DatePostedText":"6 days ago","DatePosted":"2020-01-18T12:00","LocationText":"Orlando, FL, 32801","JobViewUrl":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","ImpressionTracking":"data-m_impr_uuid=\"a7320356-70db-46ca-908e-e540f0e74cec\" data-m_impr_a_placement_id=\"JSR2CW\" data-m_impr_s_t=\"t\" data-m_impr_j_p=\"27\" data-m_impr_j_jpm=\"1\" data-m_impr_j_lat=\"28.5418\" data-m_impr_j_long=\"-81.3736\" data-m_impr_j_jawsid=\"418397617\" data-m_impr_j_postingid=\"b55f4409-3858-483a-a2e9-65e254ec1cd2\" data-m_impr_j_jobid=\"215193478\" data-m_impr_j_cid=\"660\" data-m_impr_j_occid=\"11970\" data-m_impr_j_lid=\"385\" data-m_impr_j_jpt=\"1\" data-m_impr_j_pvc=\"monster\" data-m_impr_j_coc=\"xsummittechx\" ","Company":{"Name":"Summit Technologies","HasCompanyAddress":true,"LogoLink":""},"Text":"Java Developer","ApplyType":"ApplyOnline","IsAggregated":"false","JobViewUrlMeta":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","MusangKingId":"215193478","CompanyLogoUrl":"","PrivateBoardIconImageUrl":"","FitIcon":"","FitIconType":""}
and another POST request is sent to
https://ib.adnxs.com/ut/v3
(the v3 request):
where the value 14162549
of tag_id: 14162549
seems to be taken from the above GET request.
So as you scroll down, it sends 1 GET and 1 POST requests, until it doesn't - the scroll ends and so do the requests:
I don't understand how it determines when to stop.
I want to scrape those jobs, and I can do something like sending GETs to
https://www.monster.com/jobs/search/pagination/?q=java&where=usa&isDynamicPage=true&isMKPagination=true&page=N
but I wouldn't know when to stop, because if say, it stops scrolling when &page=12
, if I send a request to &page=13
, it wouldn't return an empty JSON, on the contrary, it would show some other jobs (maybe less relevant and therefore not visible when scrolling to the bottom).
I use okHttp
to send requests, like this:
HttpUrl.Builder urlBuilder = HttpUrl.parse(getUrl()).newBuilder();
urlBuilder.addQueryParameter("page", "1");
String url = urlBuilder.build().toString();
Request request = new Request.Builder()
.url(url)
.addHeader("Content-Type", "application/json; charset=utf-8")
.addHeader("Accept-Language", Locale.US.getLanguage())
.build();
OkHttpClient client = new OkHttpClient();
Call call = client.newCall(request);
Response response = call.execute();
String responseBody = response.body().string();
System.out.println(responseBody);
Gson gson = new Gson();
List<MonsterJobJson> resultMonster = gson.fromJson(
responseBody, new TypeToken<List<MonsterJobJson>>() {
}.getType());
Not enough reputation to just comment.
You might look at div.mux-search-results
. It seems to have some attributes that describe how to load more results, and total number of results to display per page and total. Some attributes that seemed related listed below;
data-results-page="1"
Data-results-url="https://www.monster.com/jobs/search/pagination/?q=java&where=usa&stpage=1&isDynamicPage=true&isMKPagination=true"
data-results-per-page="25"
data-results-total="250"
data-total-search-results="61503"
data-results-max="250"