Search code examples
pythonseleniumweb-scrapinginstagraminfinite-scroll

How to scroll down to the end on Instagram


I tried to scrape the urls of posts from instagram based on a hashtag "foody". Using selenium and beautifulsoup, I could scrape around 2,160 urls of posts.

However, I could not scrape beyond that (there are more than 4,000,000 posts). Are there any alternatives to scrape the entire posts with "foody" hashtag? Or at least urls of posts that were posted between 2018-2019?

Below is my code for scraping.

Thanks!

    
    
    instagram_url = "https://www.instagram.com" 
    tag_url = "https://www.instagram.com/explore/tags"
    ads = "foody" # hashtag
    
    #pausetime
    pause_time = 2
    
    #driver
    driver = webdriver.Chrome("chromedriver.exe")
    
    #go to hashtag page
    driver.get(f"{tag_url}/{ads}")
    time.sleep(pause_time)

    #scroll down
    lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    match=False
    i = 0
    while(match==False):
        #urls
        html = driver.page_source
        bs_html = BeautifulSoup(html, "lxml")
        for roots in bs_html.find_all(name="div", attrs={"class":"Nnq7C weEfm"}):
            for link in roots.select("a"):
                real = link.attrs["href"]
                if real not in reallink:
                    reallink.append(real)
        print("appendend data: ", len(reallink))
        
        #Scroll down   
        lastCount = lenOfPage
        print(f"scrolling down {i}")
        i += 1
        time.sleep(pause_time)
        lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        if lastCount==lenOfPage:
            match=True

Solution

  • Try Social Scroll for Instagram extension (I know it's really basic but it works). As Alvaro Bataller said if you write some script to scroll down then after scrolling several post instagram system will atomically block you for certain period of time thinking you could be a bot.

    But this extension has a built in cool down system and it will pause the scrolling so that insta system won't mistake you as a bot. And it could easily reach you to the end post without getting time blocked by insta.