I tried to scrape the urls of posts from instagram based on a hashtag "foody". Using selenium and beautifulsoup, I could scrape around 2,160 urls of posts.
However, I could not scrape beyond that (there are more than 4,000,000 posts). Are there any alternatives to scrape the entire posts with "foody" hashtag? Or at least urls of posts that were posted between 2018-2019?
Below is my code for scraping.
Thanks!
instagram_url = "https://www.instagram.com"
tag_url = "https://www.instagram.com/explore/tags"
ads = "foody" # hashtag
#pausetime
pause_time = 2
#driver
driver = webdriver.Chrome("chromedriver.exe")
#go to hashtag page
driver.get(f"{tag_url}/{ads}")
time.sleep(pause_time)
#scroll down
lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match=False
i = 0
while(match==False):
#urls
html = driver.page_source
bs_html = BeautifulSoup(html, "lxml")
for roots in bs_html.find_all(name="div", attrs={"class":"Nnq7C weEfm"}):
for link in roots.select("a"):
real = link.attrs["href"]
if real not in reallink:
reallink.append(real)
print("appendend data: ", len(reallink))
#Scroll down
lastCount = lenOfPage
print(f"scrolling down {i}")
i += 1
time.sleep(pause_time)
lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
if lastCount==lenOfPage:
match=True
Try Social Scroll for Instagram extension (I know it's really basic but it works). As Alvaro Bataller said if you write some script to scroll down then after scrolling several post instagram system will atomically block you for certain period of time thinking you could be a bot.
But this extension has a built in cool down system and it will pause the scrolling so that insta system won't mistake you as a bot. And it could easily reach you to the end post without getting time blocked by insta.