Search code examples
pythonrselenium-webdriverweb-scrapingrselenium

simulate scrolling in Rselenium or selenium in python


I am trying to scrape this website. You need to click on the magnifying galss icon in the search bar to see the records I want to extract. The issue is that the website is dynamic and I need to scroll multiple times to get the whole page loaded then I can extract the content with rvest or BeautifulSoap However, none of the scorlling methods in the threads worked for me so far.

I appreciate if a solution can be found in R or Python using any package or library.

I tried

remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")

Where remDr is the page after clicking on the magnfiying glass icon

I also tried defining search results where I inspected the page and extract the xpath that can guide to list of items

search_results <- remDr$findElement( using = 'xpath', '//*[@id="search-feature-container"]/div[2]/div[2]/div[3]/div[2]/div[1]' )

then I ran this line but no scrolling at all :( search_results$sendKeysToElement(list(key = "down"))


Solution

  • You're asking for R or Python solutions, so here is one in Python: that information is being fed dynamically in page via XHR calls, which you can see in browser's Dev Tools - Network tab.

    Here is one way to get all that research data:

    import requests
    from bs4 import BeautifulSoup as bs
    import pandas as pd
    import json
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_colwidth', None)
    
    headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
    }
    
    big_df = pd.DataFrame()
    s = requests.Session()
    s.headers.update(headers)
    for x in range(0, 8000, 1000):
        r = s.get(f'https://vivli-prod-cus-srch.search.windows.net/indexes/studies/docs?api-key=C8237BFE70B9CC48489DC7DD84D88379&api-version=2016-09-01&$top=1000&$skip={x}&search=*&$filter=assignedAppType%20eq%20%27Default%27&$count=true&facet=studyDesign&facet=locationsOfStudySites,count:300,sort:value&facet=sponsorType&facet=contributorType&facet=sponsorName,count:500,sort:value&facet=studyType&facet=actualEnrollment,interval:100')
        df = pd.json_normalize(r.json()['value'])
        big_df = pd.concat([big_df, df])
    print(big_df)
    

    Result in terminal (limited to first two rows, there are 7K+ records in dataframe):

        @search.score   id  title   sponsorProtocolId   orgId   orgCode orgName irpOrgName  sponsorName overrideDisplayDefaults nctId   secondaryIds    acronym participantTermCodes    participantTerms    interventionTermCodes   interventionTerms   outcomeTermCodes    outcomeTerms    searchParticipantTermCodes  searchOutcomeTermCodes  searchInterventionTermCodes actualEnrollment    locationsOfStudySites   studyType   studyDesign principalInvestigator   studyStartDate  studyEndDate    sponsorType contributorType studyDoi    phase   conditions  interventionNames   outcomeNames    extractedConditions extractedInterventions  antimicrobials  groupingsOfResistancePatterns   organisms   specimenSources sampleTimes countries   regions yearsDataCollected  containsPediatrics  containsGenotype    assignedAppType numberOfIsolates    program lastUpdatedDate
    0   1.0 abd778c4-21ed-4063-9e34-e3e7b177db18    A Randomized, Double-Blind, Parallel-Group, Dose-Response Study to Evaluate the Efficacy and Safety of Two Doses of Topiramate Compared to Placebo and Propranolol in the Prophylaxis of Migraine   CR003205    d1bd067d-3e2d-43b5-80f1-6235e85c2876    JNJ Johnson & Johnson   Yoda Project    Johnson & Johnson Pharmaceutical Research & Development, L.L.C. N   NCT00236561 []      [lr5qxyw6ww35, kk05h7rpym8w, kk05h7rpym8x, kk05h7rpym8y, kk05h7rpym8z, kk05h7rpym90, kk05h7rpym91, r4hp3896n2zy]    [Male and Female, Child 6-12 years, Adolescent 13-18 years, Young Adult 19-24 years, Adult 19-44 years, Middle Aged 45-64 years, Aged 65-79 years, Migraine]    [kn3ptfq7c6lz, r4hp0qywwn28, 11g43clqdpk96, r4hp0r5sbtj7, q25gz0m8n54j, r4hp0r2dwmn5]   [Pharmacological, Topiramate, Oral, Propranolol, No active treatment, Placebo]  [q25g9q497cwj, r4hp3896n2zy, r4hp5zkjq0c3, ZxM7N2m9kOhRe2]  [Physiological or clinical, Migraine, Evaluating Response To Treatment, Assessment Of Quality Of Life]  [lr5qxyw6ww35, kk05h7rpym8w, pwhpjmwdbgkh, kk05h7rpym8x, kk05h7rpym8y, pwhpjmwdbgkg, kk05h7rpym8z, kk05h7rpym90, kk05h7rpym91, pwhpjmwdbgkf, r4hp3896n2zy, r4hp3p8ymhbg, r4hp38gs74r1, r4hp3885vk99, r4hp38mgkgb9, r4hp39w4k8tw, r4hp38c875ch, r4hp38mgkgj7, r4hp38xpp96f, r4hp3853gyf1, r4hp38l4pbqh, r4hp39krwnf7, r4hp38qpgvxq, r4hp387wrzbr, r4hp38mrn1cp, r4hp39tp4ckr, r4hp38819rxs, r4hp39mjd4qj, r4hp39cb1vjv]  [q25g9q497cwj, r4hp3896n2zy, r4hp3p8ymhbg, r4hp38gs74r1, r4hp3885vk99, r4hp38mgkgb9, r4hp39w4k8tw, r4hp38c875ch, r4hp38mgkgj7, r4hp38xpp96f, r4hp3853gyf1, r4hp38l4pbqh, r4hp39krwnf7, r4hp38qpgvxq, r4hp387wrzbr, r4hp38mrn1cp, r4hp39tp4ckr, r4hp38819rxs, r4hp39mjd4qj, r4hp39cb1vjv, r4hp5zkjq0c3, r4hp5zjccp22, r4hp5zjccp1z, r4hp5zm4npzj, r4hp5zhs6j1c, zPNWxozYM3fxBr, r4hp5zjng89p, r4hp5yw4mj85, ZxM7N2m9kOhRe2, 3BgZRR0YwkHzkP]  [kn3ptfq7c6lz, r4hp0qywwn28, r4hp13n1ty7w, r4hp13rf9486, 11g43clqdpk96, r4hp0r5sbtj7, zrcts8tmxp0g, r4hp13n1ty7r, r4hp13mrrc91, r4hp13mrrc8c, r4hp13mgns4j, r4hp13mrrc83, q25gz0m8n54j, r4hp0r2dwmn5, PXmmxKGR3ocNEg]   786 []  Interventional  ParallelGroup       2001-04-01T00:00:00Z    2002-12-31T00:00:00Z    Industry    Unassigned  https://doi.org/10.25934/00004657   Phase3  [Migraine]  [Topiramate, Propranolol, Placebo]  [Migraine, Evaluating Response To Treatment, Migraine, Assessment Of Quality Of Life]   [Migraine, Common Migraine, Classic Migraine, Headache] [topiramate, propranolol]   []  []  []  []  []  []  []  []  None    None    Default 0       
    1   1.0 48c15b9e-76d7-45cc-a044-6c253da74ac1    A Phase 3, Randomized, Open-label, Parallel-group, Multicenter Trial to Evaluate the Safety and Efficacy of Infliximab (REMICADE�) in Pediatric Subjects With Moderately to Severely Active Ulcerative Colitis  CR012388    d1bd067d-3e2d-43b5-80f1-6235e85c2876    JNJ Johnson & Johnson   Yoda Project    Centocor, Inc.  N   NCT00336492 [C0168T72]      [lr5qxyw6ww35, kk05h7rpym8v, kk05h7rpym8w, kk05h7rpym8x, r4hp3q5y2klm]  [Male and Female, Child, Preschool 2-5 years, Child 6-12 years, Adolescent 13-18 years, Acute Ulcerative Colitis]   [kn3ptfq7c6lz, r4hp13l4sngc, 11g43clqdpk72] [Pharmacological, Infliximab, Intravenous]  [q25g9q497cwj, r4hp5zkjq0c3, r4hp5zfl2n7g]  [Physiological or clinical, Evaluating Response To Treatment, Activity Analysis]    [lr5qxyw6ww35, kk05h7rpym8v, pwhpjmwdbgkh, kk05h7rpym8w, kk05h7rpym8x, r4hp3q5y2klm, r4hp384nvkyl, r4hp39vkd3t3, r4hp39lc1tgs, r4hp39hf705k, r4hp39kgt2jy, r4hp38mgkgb9, r4hp39w4k8tw, r4hp38c875ch, r4hp38mgkgj7, r4hp38jd5vlp, r4hp38gxry6k, r4hp38bczf6g, r4hp38yky17z, r4hp38z9n01d, r4hp39qlrrnf, r4hp381fy5cs, r4hp381fy5cw, r4hp393pwqm9, r4hp39mjd4qj, r4hp3b0d86ss, r4hp39znk89c, r4hp39b989y6, r4hp38mb0nx2, r4hp39ys9j31, r4hp39ln4dll, r4hp39krwnf7, r4hp39l6j13q, r4hp38hhy39p, r4hp381fy5cl, r4hp38jtt70k, r4hp38f9t7jr, r4hp39zj0gsm, r4hp38nsfl6q, r4hp38n1qmfs, r4hp39ln4dhj, r4hp39j9gr79, r4hp38jp8fh0, r4hp38y8vgc2, r4hp39v3rr24, r4hp3b0twljt, r4hp38819rv0, r4hp3pdb2p7r, r4hp39hf702g, eM3W2jDdq4CnoM]  [q25g9q497cwj, r4hp5zkjq0c3, r4hp5zjccp22, r4hp5zjccp1z, r4hp5zm4npzj, r4hp5zhs6j1c, zPNWxozYM3fxBr, r4hp5zjng89p, r4hp5yw4mj85, r4hp5zfl2n7g, r4hp5yxm1fj5, r4hp5yq9rf4h, r4hp5z5crc2v, r4hp5zbhq1cb, r4hp5z0tyv1k, r4hp5yvdxkr1, r4hp5zjccp2h, r4hp5zkzbcvf]  [kn3ptfq7c6lz, r4hp13l4sngc, r4hp13nhg9tp, YgJdXZMgAyT4za, r4hp13n1ty7z, r4hp13qznrsn, 3r0XoawY07FG2Z, 11g43clqdpk72, PNz3A1OgQesRKw, 11g43clqdpk4z, r4hp5z5nty2h, r4hp5zj2934z, r4hp5zhs6j1c, zPNWxozYM3fxBr]  60  [United States, Canada, Belgium, Denmark, Netherlands]  Interventional  ParallelGroup       2006-09-01T00:00:00Z    2010-04-30T00:00:00Z    Industry    Unassigned  https://doi.org/10.25934/00004723   Phase3  [Acute Ulcerative Colitis]  [Infliximab]    [Evaluating Response To Treatment, Activity Analysis]   [Ulcerative Colitis]    [infliximab]    []  []  []  []  []  []  []  []  None    None    Default 0       
    

    Relevant documentation: pandas and requests.