Search code examples
htmlweb-scrapingopenrefine

Openrefine cannot fetch html code inside accordion


I know that openrefine is not a perfect tool for web scraping but looking for some helps from the first step.

I cannot collect the full html codes from openrefine when I add column by fetching url (https://profiles.health.ny.gov/hospital/view/103094). They do not incorporate any codes under accordion such as services, bed types, and etc.

Any idea to get the full codes by fetching in openrefine? I am trying to collect information under administrative, whose Xpath is "//div[4]/div/ul/li" ("div#AdministrativeBox.in.collapse")


Solution

  • This website loads its content dynamically using Javascript. The information that interests you is not stored in the source code of the page, so Open Refine cannot extract it.

    However, there is a workaround. If you transform your URLs with the GREL formula value.replace('view', 'tab_overview'), you will get scrapable pages like this one.

    Note that OpenRefine does not use Xpath, but JSOUP selectors. To get the elements of the "Administrative" block, you can use this GREL formula.

    forEach(value.parseHtml().select('#AdministrativeBox li'), e, e.htmlText()).join(',')
    

    Result:

    enter image description here