Search code examples
htmlriframehttr

Extracting graph data from a specific iframe on web page in R


I want to extract the data presented in a graph on a website and automatically download it. It is the first graph after the headline: "Office Investment Prices (gross € psqm)" on: https://www.immostat.com/market-data

The data however is behind an iframe and keeps updating which updates the graph, which updates the link in the iframe. So I need to dynamically update the link.

Using the devtools in the browser I managed to find the data in a sourcre called "htmlComp-iframe (ae915e_6e3e7e9d19bc8d5af1f3e4b96ae5c686.html) and can be loaded via the link : https://www-immostat-com.filesusr.com/html/ae915e_6e3e7e9d19bc8d5af1f3e4b96ae5c686.html

I have very limited experience with HTML etc. but I gathered that the graph is loaded as a seperate web page in an iframe. So I expect to only find the link in the original HTML.

Since this link points to a static set of data and a new link is used when new data is available,I need to find a way to find the data dynamically. I however am unable to find the link in the content when I do at httr:GET and extract it with content("text").

httr::GET(
"https://www.immostat.com/market-data", 
verbose() 
) %>% 
content("text")

Am I missing something? How would I


Solution

  • From page source you can only find placeholders or anchors for dynamic elements, e.g for that specific iframe there's:

    <div id="comp-iw3d16s21" class="comp-iw3d16s21 _xg6_p"></div>
    

    All those dynamic elements are rendered by Javascript in your browser, part of this process is coordinated by the asset manager that translates id-s like comp-iw3d16s21 to url-s for that specific page version. Part of that response looks like this:

                    "comp-iw3d16s21": {
                        "url": "https://www-immostat-com.filesusr.com/html/ae915e_6e3e7e9d19bc8d5af1f3e4b96ae5c686.html",
                        "translations": {
                            "title": "Embedded Content"
                        }
                    }
    

    Technically you can make that request yourself too end extract iframe urls, though it includes 35 arguments and making it work for next page releases looks like a proper challenge:

    https://siteassets.parastorage.com/pages/pages/thunderbolt?appDefinitionIdToSiteRevision=%7B%2214bcded7-0066-7c35-14d7-466cb3f09103%22%3A%22855%22%7D&beckyExperiments=specs.thunderbolt.supportSpxInEEMappers%3Atrue%2Cspecs.thunderbolt.one_cell_grid_display_flex%3Atrue%2Cspecs.thunderbolt.MediaContainerAndPageBackgroundMapper%3Atrue%2Cspecs.thunderbolt.catharsis_theme_optimize_css%3Atrue%2Cspecs.thunderbolt.WRichTextSemanticClasses%3Atrue%2Cspecs.thunderbolt.ghostify_hidden_comps%3Atrue%2Cspecs.thunderbolt.edixIsInFirstFold%3Atrue%2Cspecs.thunderbolt.catharsis_theme%3Atrue%2Cspecs.thunderbolt.DatePickerPortal%3Atrue%2Cspecs.thunderbolt.native_css_mappers_popups%3Atrue%2Cspecs.thunderbolt.wowImageRelayout%3Atrue%2Cspecs.thunderbolt.useElementoryRelativePath%3Atrue%2Cspecs.thunderbolt.new_responsive_layout_render_all_breakpoints%3Atrue%2Cspecs.thunderbolt.mesh_css_catharsis%3Atrue%2Cspecs.thunderbolt.DDMenuMigrateCssCarmiMapper%3Atrue%2Cspecs.thunderbolt.responsiveShapeDividersPublic%3Atrue%2Cspecs.thunderbolt.compsMeasuresCss_catharsis%3Atrue%2Cspecs.thunderbolt.customElemCollapsedheight%3Atrue%2Cspecs.thunderbolt.url_hierarchy%3Atrue%2Cspecs.thunderbolt.scaleprop%3Atrue%2Cspecs.thunderbolt.interactionsOverrides%3Atrue%2Cspecs.thunderbolt.displayRefComponentsAsBlock%3Atrue%2Cspecs.thunderbolt.pinned_layout_css_catharsis%3Atrue%2CuseTranslatedUrlSlugs%3Atrue%2Cspecs.thunderbolt.responsiveLayout_optimize_css%3Atrue%2Cspecs.thunderbolt.theme_fonts_colors_catharsis%3Atrue%2Cspecs.thunderbolt.catharsis_fontFaces%3Atrue&contentType=application%2Fjson&deviceType=Desktop&dfCk=6&dfVersion=1.2684.0&disableStaticPagesUrlHierarchy=false&editorName=Unknown&experiments=bv_remove_add_chat_viewer_fixer%2Cdm_linkTargetDefaults%2Cdm_removePageDataUnderTranslations%2Cdm_runTranslationsPageUriSeoFixer&externalBaseUrl=https%3A%2F%2Fwww.immostat.com&fileId=5745cace.bundle.min&formFactor=desktop&hasTPAWorkerOnSite=false&isConsentPolicyActive=true&isHttps=true&isInSeo=false&isMultilingualEnabled=false&isPremiumDomain=true&isTrackClicksAnalyticsEnabled=false&isUrlMigrated=true&isWixCodeOnPage=false&isWixCodeOnSite=false&language=fr&languageResolutionMethod=QueryParam&metaSiteId=6d08802f-53b1-41d3-8efe-e9ff693936c5&module=thunderbolt-features&originalLanguage=fr&pageId=ae915e_8a4ec4fca009e211728288dc101e4786_1143.json&quickActionsMenuEnabled=false&registryLibrariesTopology=%5B%7B%22artifactId%22%3A%22editor-elements%22%2C%22namespace%22%3A%22wixui%22%2C%22url%22%3A%22https%3A%2F%2Fstatic.parastorage.com%2Fservices%2Feditor-elements%2F1.11361.0%22%7D%2C%7B%22artifactId%22%3A%22editor-elements%22%2C%22namespace%22%3A%22dsgnsys%22%2C%22url%22%3A%22https%3A%2F%2Fstatic.parastorage.com%2Fservices%2Feditor-elements%2F1.11361.0%22%7D%5D&remoteWidgetStructureBuilderVersion=1.238.0&siteId=64b052e8-197b-46d4-8d7f-2954e7e9a6a6&siteRevision=1144&staticHTMLComponentUrl=https%3A%2F%2Fwww-immostat-com.filesusr.com%2F&useSandboxInHTMLComp=false&viewMode=desktop
    

    I personally would lean towards headless browsers that can execute site's JavaScript and render that page as end-user would see it; evaluating custom javascript in such an environment is also super-handy. A personal preference here is {chromote} , but it's relatively similar with RSelenium / Selenium.

    I'm sure the process can be optimized, but here it goes something like this:

    • load page with chromote, site's JavaScript will be executed and all those elements accessible through your web browser element inspector are accessible;
    • find iframe with JavaScript + XPath, use a heading with predefined text as an anchor, extract iframe url;
    • load url ( includes javascript for Google Charts) and parse content with {rvest}, it's handy for extracting elements by CSS selectors and XPath;
    • grab a bit of JavaScript responsible for setting up data series for line chart and modify it to output JSON string of that series.
    • evaluate modified JS in chromote's JS runtime (we already have it up and running, so we might as well use it)
    • parse returned JSON in R, turn list into data.frame
    library(rvest)
    library(stringr)
    library(chromote)
    
    # create new Chrome session, wait until page is loaded + few more moments
    b <- ChromoteSession$new()
    {
      b$Page$navigate("https://www.immostat.com/market-data")
      b$Page$loadEventFired()
      Sys.sleep(.5)
    }
    
    # evaluate JavaScript in Chromote:
    # xpath to find iframe in relation to the h2 element with specific text content,
    # "Office Investment Prices (gross € psqm)",
    # extract src atribute from the first matched iframe
    iframe_src <- b$Runtime$evaluate('
    var xpath = \'//h2[text()="Office Investment Prices (gross € psqm)"]/../following-sibling::div/wix-iframe/div/iframe\';
       document.evaluate(
         xpath, 
         document, null, XPathResult.UNORDERED_NODE_ITERATOR_TYPE, null )
       .iterateNext()
       .getAttribute("src")')$result$value
    iframe_src
    #> [1] "https://www-immostat-com.filesusr.com/html/ae915e_6e3e7e9d19bc8d5af1f3e4b96ae5c686.html"
    

    You can stop here and ignore the rest if data series extraction from javascript is already up and running.

    # with rvest extract script element from iframe page source, we only need element containing  "var dataIDF",
    # split it by linefeed,
    # get a line from jsvsacript with line chart data,
    # modify resulting javascript so it would return google.visualization.arrayToDataTable() argument as 
    # JSON string, i.e.
    # var dataIDF = google.visualization.arrayToDataTable([['Quarter','Greater Paris Region'],['Q1 2006',4490],...)
    # becomes: 
    # JSON.stringify([['Quarter','Greater Paris Region'],['Q1 2006',4490],...)
    array_stringify_js <- read_html(iframe_src) |>
      html_element(xpath = "//script[contains(., 'var dataIDF')]") |>
      html_text() |>
      str_split("\n") |>
      unlist() |>
      str_subset("var dataIDF") |>
      str_replace("var.*arrayToDataTable", "JSON.stringify")
    str_trunc(array_stringify_js, 80)
    #> [1] "\tJSON.stringify([['Quarter','Greater Paris Region'],['Q1 2006',4490],['Q2 200..."
    
    # use existing JS runtime to evalute that frankenscript we just created,
    # parse resulting JSON with jsonlite, result is a list
    chart_data_lst <- b$Runtime$evaluate(array_stringify_js)$result$value |> 
      jsonlite::parse_json() 
    str(chart_data_lst[1:3])
    #> List of 3
    #>  $ :List of 2
    #>   ..$ : chr "Quarter"
    #>   ..$ : chr "Greater Paris Region"
    #>  $ :List of 2
    #>   ..$ : chr "Q1 2006"
    #>   ..$ : int 4490
    #>  $ :List of 2
    #>   ..$ : chr "Q2 2006"
    #>   ..$ : int 4631
    
    # returned list structure could use some tweaking, 
    # we want list of named lists and items with same names should share same types,
    # this structure can be converted to data.frame with proper column types,  
    # without having to deal with numeric values turned into strings;
    
    # use the first list item for names for all others, drop the first element, 
    # bind list of named lists to data.frame
    chart_data <- lapply(chart_data_lst[-1], setNames, chart_data_lst[[1]]) |>
      do.call(rbind, args = _) |>
      as.data.frame()
    

    Result:

    head(chart_data)
    #>   Quarter Greater Paris Region
    #> 1 Q1 2006                 4490
    #> 2 Q2 2006                 4631
    #> 3 Q3 2006                 4803
    #> 4 Q4 2006                 4837
    #> 5 Q1 2007                 4953
    #> 6 Q2 2007                 5064