Extracting graph data from a specific iframe on web page in R

I want to extract the data presented in a graph on a website and automatically download it. It is the first graph after the headline: "Office Investment Prices (gross € psqm)" on: https://www.immostat.com/market-data

The data however is behind an iframe and keeps updating which updates the graph, which updates the link in the iframe. So I need to dynamically update the link.

Using the devtools in the browser I managed to find the data in a sourcre called "htmlComp-iframe (ae915e_6e3e7e9d19bc8d5af1f3e4b96ae5c686.html) and can be loaded via the link : https://www-immostat-com.filesusr.com/html/ae915e_6e3e7e9d19bc8d5af1f3e4b96ae5c686.html

I have very limited experience with HTML etc. but I gathered that the graph is loaded as a seperate web page in an iframe. So I expect to only find the link in the original HTML.

Since this link points to a static set of data and a new link is used when new data is available,I need to find a way to find the data dynamically. I however am unable to find the link in the content when I do at httr:GET and extract it with content("text").

httr::GET(
"https://www.immostat.com/market-data", 
verbose() 
) %>% 
content("text")

Am I missing something? How would I

Solution

From page source you can only find placeholders or anchors for dynamic elements, e.g for that specific iframe there's:

<div id="comp-iw3d16s21" class="comp-iw3d16s21 _xg6_p"></div>

All those dynamic elements are rendered by Javascript in your browser, part of this process is coordinated by the asset manager that translates id-s like comp-iw3d16s21 to url-s for that specific page version. Part of that response looks like this:

                "comp-iw3d16s21": {
                    "url": "https://www-immostat-com.filesusr.com/html/ae915e_6e3e7e9d19bc8d5af1f3e4b96ae5c686.html",
                    "translations": {
                        "title": "Embedded Content"
                    }
                }

Technically you can make that request yourself too end extract iframe urls, though it includes 35 arguments and making it work for next page releases looks like a proper challenge:

https://siteassets.parastorage.com/pages/pages/thunderbolt?appDefinitionIdToSiteRevision=%7B%2214bcded7-0066-7c35-14d7-466cb3f09103%22%3A%22855%22%7D&beckyExperiments=specs.thunderbolt.supportSpxInEEMappers%3Atrue%2Cspecs.thunderbolt.one_cell_grid_display_flex%3Atrue%2Cspecs.thunderbolt.MediaContainerAndPageBackgroundMapper%3Atrue%2Cspecs.thunderbolt.catharsis_theme_optimize_css%3Atrue%2Cspecs.thunderbolt.WRichTextSemanticClasses%3Atrue%2Cspecs.thunderbolt.ghostify_hidden_comps%3Atrue%2Cspecs.thunderbolt.edixIsInFirstFold%3Atrue%2Cspecs.thunderbolt.catharsis_theme%3Atrue%2Cspecs.thunderbolt.DatePickerPortal%3Atrue%2Cspecs.thunderbolt.native_css_mappers_popups%3Atrue%2Cspecs.thunderbolt.wowImageRelayout%3Atrue%2Cspecs.thunderbolt.useElementoryRelativePath%3Atrue%2Cspecs.thunderbolt.new_responsive_layout_render_all_breakpoints%3Atrue%2Cspecs.thunderbolt.mesh_css_catharsis%3Atrue%2Cspecs.thunderbolt.DDMenuMigrateCssCarmiMapper%3Atrue%2Cspecs.thunderbolt.responsiveShapeDividersPublic%3Atrue%2Cspecs.thunderbolt.compsMeasuresCss_catharsis%3Atrue%2Cspecs.thunderbolt.customElemCollapsedheight%3Atrue%2Cspecs.thunderbolt.url_hierarchy%3Atrue%2Cspecs.thunderbolt.scaleprop%3Atrue%2Cspecs.thunderbolt.interactionsOverrides%3Atrue%2Cspecs.thunderbolt.displayRefComponentsAsBlock%3Atrue%2Cspecs.thunderbolt.pinned_layout_css_catharsis%3Atrue%2CuseTranslatedUrlSlugs%3Atrue%2Cspecs.thunderbolt.responsiveLayout_optimize_css%3Atrue%2Cspecs.thunderbolt.theme_fonts_colors_catharsis%3Atrue%2Cspecs.thunderbolt.catharsis_fontFaces%3Atrue&contentType=application%2Fjson&deviceType=Desktop&dfCk=6&dfVersion=1.2684.0&disableStaticPagesUrlHierarchy=false&editorName=Unknown&experiments=bv_remove_add_chat_viewer_fixer%2Cdm_linkTargetDefaults%2Cdm_removePageDataUnderTranslations%2Cdm_runTranslationsPageUriSeoFixer&externalBaseUrl=https%3A%2F%2Fwww.immostat.com&fileId=5745cace.bundle.min&formFactor=desktop&hasTPAWorkerOnSite=false&isConsentPolicyActive=true&isHttps=true&isInSeo=false&isMultilingualEnabled=false&isPremiumDomain=true&isTrackClicksAnalyticsEnabled=false&isUrlMigrated=true&isWixCodeOnPage=false&isWixCodeOnSite=false&language=fr&languageResolutionMethod=QueryParam&metaSiteId=6d08802f-53b1-41d3-8efe-e9ff693936c5&module=thunderbolt-features&originalLanguage=fr&pageId=ae915e_8a4ec4fca009e211728288dc101e4786_1143.json&quickActionsMenuEnabled=false&registryLibrariesTopology=%5B%7B%22artifactId%22%3A%22editor-elements%22%2C%22namespace%22%3A%22wixui%22%2C%22url%22%3A%22https%3A%2F%2Fstatic.parastorage.com%2Fservices%2Feditor-elements%2F1.11361.0%22%7D%2C%7B%22artifactId%22%3A%22editor-elements%22%2C%22namespace%22%3A%22dsgnsys%22%2C%22url%22%3A%22https%3A%2F%2Fstatic.parastorage.com%2Fservices%2Feditor-elements%2F1.11361.0%22%7D%5D&remoteWidgetStructureBuilderVersion=1.238.0&siteId=64b052e8-197b-46d4-8d7f-2954e7e9a6a6&siteRevision=1144&staticHTMLComponentUrl=https%3A%2F%2Fwww-immostat-com.filesusr.com%2F&useSandboxInHTMLComp=false&viewMode=desktop

I personally would lean towards headless browsers that can execute site's JavaScript and render that page as end-user would see it; evaluating custom javascript in such an environment is also super-handy. A personal preference here is {chromote} , but it's relatively similar with RSelenium / Selenium.

I'm sure the process can be optimized, but here it goes something like this:

load page with chromote, site's JavaScript will be executed and all those elements accessible through your web browser element inspector are accessible;
find iframe with JavaScript + XPath, use a heading with predefined text as an anchor, extract iframe url;
load url ( includes javascript for Google Charts) and parse content with {rvest}, it's handy for extracting elements by CSS selectors and XPath;
grab a bit of JavaScript responsible for setting up data series for line chart and modify it to output JSON string of that series.
evaluate modified JS in chromote's JS runtime (we already have it up and running, so we might as well use it)
parse returned JSON in R, turn list into data.frame

library(rvest)
library(stringr)
library(chromote)

# create new Chrome session, wait until page is loaded + few more moments
b <- ChromoteSession$new()
{
  b$Page$navigate("https://www.immostat.com/market-data")
  b$Page$loadEventFired()
  Sys.sleep(.5)
}

# evaluate JavaScript in Chromote:
# xpath to find iframe in relation to the h2 element with specific text content,
# "Office Investment Prices (gross € psqm)",
# extract src atribute from the first matched iframe
iframe_src <- b$Runtime$evaluate('
var xpath = \'//h2[text()="Office Investment Prices (gross € psqm)"]/../following-sibling::div/wix-iframe/div/iframe\';
   document.evaluate(
     xpath, 
     document, null, XPathResult.UNORDERED_NODE_ITERATOR_TYPE, null )
   .iterateNext()
   .getAttribute("src")')$result$value
iframe_src
#> [1] "https://www-immostat-com.filesusr.com/html/ae915e_6e3e7e9d19bc8d5af1f3e4b96ae5c686.html"

You can stop here and ignore the rest if data series extraction from javascript is already up and running.

# with rvest extract script element from iframe page source, we only need element containing  "var dataIDF",
# split it by linefeed,
# get a line from jsvsacript with line chart data,
# modify resulting javascript so it would return google.visualization.arrayToDataTable() argument as 
# JSON string, i.e.
# var dataIDF = google.visualization.arrayToDataTable([['Quarter','Greater Paris Region'],['Q1 2006',4490],...)
# becomes: 
# JSON.stringify([['Quarter','Greater Paris Region'],['Q1 2006',4490],...)
array_stringify_js <- read_html(iframe_src) |>
  html_element(xpath = "//script[contains(., 'var dataIDF')]") |>
  html_text() |>
  str_split("\n") |>
  unlist() |>
  str_subset("var dataIDF") |>
  str_replace("var.*arrayToDataTable", "JSON.stringify")
str_trunc(array_stringify_js, 80)
#> [1] "\tJSON.stringify([['Quarter','Greater Paris Region'],['Q1 2006',4490],['Q2 200..."

# use existing JS runtime to evalute that frankenscript we just created,
# parse resulting JSON with jsonlite, result is a list
chart_data_lst <- b$Runtime$evaluate(array_stringify_js)$result$value |> 
  jsonlite::parse_json() 
str(chart_data_lst[1:3])
#> List of 3
#>  $ :List of 2
#>   ..$ : chr "Quarter"
#>   ..$ : chr "Greater Paris Region"
#>  $ :List of 2
#>   ..$ : chr "Q1 2006"
#>   ..$ : int 4490
#>  $ :List of 2
#>   ..$ : chr "Q2 2006"
#>   ..$ : int 4631

# returned list structure could use some tweaking, 
# we want list of named lists and items with same names should share same types,
# this structure can be converted to data.frame with proper column types,  
# without having to deal with numeric values turned into strings;

# use the first list item for names for all others, drop the first element, 
# bind list of named lists to data.frame
chart_data <- lapply(chart_data_lst[-1], setNames, chart_data_lst[[1]]) |>
  do.call(rbind, args = _) |>
  as.data.frame()

Result:

head(chart_data)
#>   Quarter Greater Paris Region
#> 1 Q1 2006                 4490
#> 2 Q2 2006                 4631
#> 3 Q3 2006                 4803
#> 4 Q4 2006                 4837
#> 5 Q1 2007                 4953
#> 6 Q2 2007                 5064