Search code examples
ruby-on-railsrubyweb-scrapingnokogiriopen-uri

Nokogiri results different from brower inspect


I am trying to scrape a site but the results returned for just the links is different from when I inspect it with the browser.

In my browser I get normal links but all the a HREF links all become javascript:void(0); from Nokogiri.

Here is the site:

https://www.ctgoodjobs.hk/jobs/part-time

Here is my code:

url = "https://www.ctgoodjobs.hk/jobs/part-time"
response = open(url) rescue nil
next unless response
doc = Nokogiri::HTML(open(url))
links = doc.search('.job-title > a').text

Solution

  • is not that easy, urls are "obscured" using a js function, that's why you're getting javascript: void(0) when asking for the hrefs... looking at the html, there are some hidden inputs for each link, and, there is a preview url that you can use to build the job preview url (if that's what you're looking for), so you have this:

    <div class="result-list-job current-view">
      <input type="hidden" name="job_id" value="04375145">
      <input type="hidden" name="each_job_title_url" value="barista-senior-barista-咖啡調配員">
      <h2 class="job-title"><a href="javascript:void(0);">Barista/ Senior Barista 咖 啡 調 配 員</a></h2>
      <h3 class="job-company"><a href="/company-jobs/pacific-coffee-company/00028652" target="_blank">PACIFIC COFFEE CO. LTD.</a></h3>
      <div class="job-description">
        <ul class="job-desc-list clearfix">
          <li class="job-desc-loc job-desc-small-icon">-</li>
          <li class="job-desc-work-exp">0-1 yr(s)</li>
          <li class="job-desc-salary job-desc-small-icon">-</li>
          <li class="job-desc-post-date">09/11/16</li>
        </ul>
      </div>
      <a class="job-save-btn" title="save this job" style="display: inline;"> </a>
      <div class="job-batch-apply"><span class="checkbox" style="background-position: 0px 0px;"></span><input type="checkbox" class="styled" name="job_checkbox" value="04375145"></div>
      <div class="job-cat job-cat-de"></div>
    </div>
    

    then, you can retrieve each job_id from those inputs, like:

     inputs = doc.search('//input[@name="job_id"]')
    

    and then build the urls (i found the base url at joblist_preview.js:

     urls = inputs.map do |input|
       "https://www.ctgoodjobs.hk/english/jobdetails/details.asp?m_jobid=#{input['value']}&joblistmode=previewlist&ga_channel=ct"
     end