I am trying to scrape this website. You need to click on the magnifying galss icon in the search bar to see the records I want to extract. The issue is that the website is dynamic and I need to scroll multiple times to get the whole page loaded then I can extract the content with rvest
or BeautifulSoap
However, none of the scorlling methods in the threads worked for me so far.
I appreciate if a solution can be found in R or Python using any package or library.
I tried
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Where remDr
is the page after clicking on the magnfiying glass icon
I also tried defining search results where I inspected the page and extract the xpath that can guide to list of items
search_results <- remDr$findElement( using = 'xpath', '//*[@id="search-feature-container"]/div[2]/div[2]/div[3]/div[2]/div[1]' )
then I ran this line but no scrolling at all :(
search_results$sendKeysToElement(list(key = "down"))
You're asking for R or Python solutions, so here is one in Python: that information is being fed dynamically in page via XHR calls, which you can see in browser's Dev Tools - Network tab.
Here is one way to get all that research data:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import json
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
big_df = pd.DataFrame()
s = requests.Session()
s.headers.update(headers)
for x in range(0, 8000, 1000):
r = s.get(f'https://vivli-prod-cus-srch.search.windows.net/indexes/studies/docs?api-key=C8237BFE70B9CC48489DC7DD84D88379&api-version=2016-09-01&$top=1000&$skip={x}&search=*&$filter=assignedAppType%20eq%20%27Default%27&$count=true&facet=studyDesign&facet=locationsOfStudySites,count:300,sort:value&facet=sponsorType&facet=contributorType&facet=sponsorName,count:500,sort:value&facet=studyType&facet=actualEnrollment,interval:100')
df = pd.json_normalize(r.json()['value'])
big_df = pd.concat([big_df, df])
print(big_df)
Result in terminal (limited to first two rows, there are 7K+ records in dataframe):
@search.score id title sponsorProtocolId orgId orgCode orgName irpOrgName sponsorName overrideDisplayDefaults nctId secondaryIds acronym participantTermCodes participantTerms interventionTermCodes interventionTerms outcomeTermCodes outcomeTerms searchParticipantTermCodes searchOutcomeTermCodes searchInterventionTermCodes actualEnrollment locationsOfStudySites studyType studyDesign principalInvestigator studyStartDate studyEndDate sponsorType contributorType studyDoi phase conditions interventionNames outcomeNames extractedConditions extractedInterventions antimicrobials groupingsOfResistancePatterns organisms specimenSources sampleTimes countries regions yearsDataCollected containsPediatrics containsGenotype assignedAppType numberOfIsolates program lastUpdatedDate
0 1.0 abd778c4-21ed-4063-9e34-e3e7b177db18 A Randomized, Double-Blind, Parallel-Group, Dose-Response Study to Evaluate the Efficacy and Safety of Two Doses of Topiramate Compared to Placebo and Propranolol in the Prophylaxis of Migraine CR003205 d1bd067d-3e2d-43b5-80f1-6235e85c2876 JNJ Johnson & Johnson Yoda Project Johnson & Johnson Pharmaceutical Research & Development, L.L.C. N NCT00236561 [] [lr5qxyw6ww35, kk05h7rpym8w, kk05h7rpym8x, kk05h7rpym8y, kk05h7rpym8z, kk05h7rpym90, kk05h7rpym91, r4hp3896n2zy] [Male and Female, Child 6-12 years, Adolescent 13-18 years, Young Adult 19-24 years, Adult 19-44 years, Middle Aged 45-64 years, Aged 65-79 years, Migraine] [kn3ptfq7c6lz, r4hp0qywwn28, 11g43clqdpk96, r4hp0r5sbtj7, q25gz0m8n54j, r4hp0r2dwmn5] [Pharmacological, Topiramate, Oral, Propranolol, No active treatment, Placebo] [q25g9q497cwj, r4hp3896n2zy, r4hp5zkjq0c3, ZxM7N2m9kOhRe2] [Physiological or clinical, Migraine, Evaluating Response To Treatment, Assessment Of Quality Of Life] [lr5qxyw6ww35, kk05h7rpym8w, pwhpjmwdbgkh, kk05h7rpym8x, kk05h7rpym8y, pwhpjmwdbgkg, kk05h7rpym8z, kk05h7rpym90, kk05h7rpym91, pwhpjmwdbgkf, r4hp3896n2zy, r4hp3p8ymhbg, r4hp38gs74r1, r4hp3885vk99, r4hp38mgkgb9, r4hp39w4k8tw, r4hp38c875ch, r4hp38mgkgj7, r4hp38xpp96f, r4hp3853gyf1, r4hp38l4pbqh, r4hp39krwnf7, r4hp38qpgvxq, r4hp387wrzbr, r4hp38mrn1cp, r4hp39tp4ckr, r4hp38819rxs, r4hp39mjd4qj, r4hp39cb1vjv] [q25g9q497cwj, r4hp3896n2zy, r4hp3p8ymhbg, r4hp38gs74r1, r4hp3885vk99, r4hp38mgkgb9, r4hp39w4k8tw, r4hp38c875ch, r4hp38mgkgj7, r4hp38xpp96f, r4hp3853gyf1, r4hp38l4pbqh, r4hp39krwnf7, r4hp38qpgvxq, r4hp387wrzbr, r4hp38mrn1cp, r4hp39tp4ckr, r4hp38819rxs, r4hp39mjd4qj, r4hp39cb1vjv, r4hp5zkjq0c3, r4hp5zjccp22, r4hp5zjccp1z, r4hp5zm4npzj, r4hp5zhs6j1c, zPNWxozYM3fxBr, r4hp5zjng89p, r4hp5yw4mj85, ZxM7N2m9kOhRe2, 3BgZRR0YwkHzkP] [kn3ptfq7c6lz, r4hp0qywwn28, r4hp13n1ty7w, r4hp13rf9486, 11g43clqdpk96, r4hp0r5sbtj7, zrcts8tmxp0g, r4hp13n1ty7r, r4hp13mrrc91, r4hp13mrrc8c, r4hp13mgns4j, r4hp13mrrc83, q25gz0m8n54j, r4hp0r2dwmn5, PXmmxKGR3ocNEg] 786 [] Interventional ParallelGroup 2001-04-01T00:00:00Z 2002-12-31T00:00:00Z Industry Unassigned https://doi.org/10.25934/00004657 Phase3 [Migraine] [Topiramate, Propranolol, Placebo] [Migraine, Evaluating Response To Treatment, Migraine, Assessment Of Quality Of Life] [Migraine, Common Migraine, Classic Migraine, Headache] [topiramate, propranolol] [] [] [] [] [] [] [] [] None None Default 0
1 1.0 48c15b9e-76d7-45cc-a044-6c253da74ac1 A Phase 3, Randomized, Open-label, Parallel-group, Multicenter Trial to Evaluate the Safety and Efficacy of Infliximab (REMICADE�) in Pediatric Subjects With Moderately to Severely Active Ulcerative Colitis CR012388 d1bd067d-3e2d-43b5-80f1-6235e85c2876 JNJ Johnson & Johnson Yoda Project Centocor, Inc. N NCT00336492 [C0168T72] [lr5qxyw6ww35, kk05h7rpym8v, kk05h7rpym8w, kk05h7rpym8x, r4hp3q5y2klm] [Male and Female, Child, Preschool 2-5 years, Child 6-12 years, Adolescent 13-18 years, Acute Ulcerative Colitis] [kn3ptfq7c6lz, r4hp13l4sngc, 11g43clqdpk72] [Pharmacological, Infliximab, Intravenous] [q25g9q497cwj, r4hp5zkjq0c3, r4hp5zfl2n7g] [Physiological or clinical, Evaluating Response To Treatment, Activity Analysis] [lr5qxyw6ww35, kk05h7rpym8v, pwhpjmwdbgkh, kk05h7rpym8w, kk05h7rpym8x, r4hp3q5y2klm, r4hp384nvkyl, r4hp39vkd3t3, r4hp39lc1tgs, r4hp39hf705k, r4hp39kgt2jy, r4hp38mgkgb9, r4hp39w4k8tw, r4hp38c875ch, r4hp38mgkgj7, r4hp38jd5vlp, r4hp38gxry6k, r4hp38bczf6g, r4hp38yky17z, r4hp38z9n01d, r4hp39qlrrnf, r4hp381fy5cs, r4hp381fy5cw, r4hp393pwqm9, r4hp39mjd4qj, r4hp3b0d86ss, r4hp39znk89c, r4hp39b989y6, r4hp38mb0nx2, r4hp39ys9j31, r4hp39ln4dll, r4hp39krwnf7, r4hp39l6j13q, r4hp38hhy39p, r4hp381fy5cl, r4hp38jtt70k, r4hp38f9t7jr, r4hp39zj0gsm, r4hp38nsfl6q, r4hp38n1qmfs, r4hp39ln4dhj, r4hp39j9gr79, r4hp38jp8fh0, r4hp38y8vgc2, r4hp39v3rr24, r4hp3b0twljt, r4hp38819rv0, r4hp3pdb2p7r, r4hp39hf702g, eM3W2jDdq4CnoM] [q25g9q497cwj, r4hp5zkjq0c3, r4hp5zjccp22, r4hp5zjccp1z, r4hp5zm4npzj, r4hp5zhs6j1c, zPNWxozYM3fxBr, r4hp5zjng89p, r4hp5yw4mj85, r4hp5zfl2n7g, r4hp5yxm1fj5, r4hp5yq9rf4h, r4hp5z5crc2v, r4hp5zbhq1cb, r4hp5z0tyv1k, r4hp5yvdxkr1, r4hp5zjccp2h, r4hp5zkzbcvf] [kn3ptfq7c6lz, r4hp13l4sngc, r4hp13nhg9tp, YgJdXZMgAyT4za, r4hp13n1ty7z, r4hp13qznrsn, 3r0XoawY07FG2Z, 11g43clqdpk72, PNz3A1OgQesRKw, 11g43clqdpk4z, r4hp5z5nty2h, r4hp5zj2934z, r4hp5zhs6j1c, zPNWxozYM3fxBr] 60 [United States, Canada, Belgium, Denmark, Netherlands] Interventional ParallelGroup 2006-09-01T00:00:00Z 2010-04-30T00:00:00Z Industry Unassigned https://doi.org/10.25934/00004723 Phase3 [Acute Ulcerative Colitis] [Infliximab] [Evaluating Response To Treatment, Activity Analysis] [Ulcerative Colitis] [infliximab] [] [] [] [] [] [] [] [] None None Default 0