Search code examples
pythonajaxseleniumweb-scrapingdata-extraction

How to load data without clicking button?


I want to scrape the all startups name from https://e27.co/startups/. You can see that by default there are 20 names of startups, and to load more you need to click by "Load more" button. This button loads 10 startup names

I have created the python script which clicking the "Load More" button until all(29000) startups will be loaded. Its takes a lot of time and RAM. How can I load this data without this clicking?

I heard something called by AJAX request but I don't understand how to implement this.

Html code of button:

<button class="button btn-load-more" data-start="0">Load More</button>

data-start parameter is changing +10 with one click

event code of button (JS)

        startupList.elem.find('.btn-load-more').off('.click').click(function(){
            startupList.elem.find('.btn-load-more').addClass('hide');
            Global.loading();
            startupList.loadMoreIsClicked = true;
            var start = $(this).attr('data-start')*1;
            start += startupList.count;
            $(this).attr('data-start', start);
            startupList.searchAndFilterResult(start, startupList.getFormData("#startup_search"), false);

My python code:

    def __init__(self):
        opp = Options()
        opp.add_argument('--blink-settings=imagesEnabled=false')
        opp.add_argument('--headless')
        self.driver = webdriver.Chrome('./chromedriver', chrome_options=opp)

    def parse(self, e27_url = "https://e27.co/startups/"):
        self.driver.get(e27_url)
        time.sleep(3)
        run_check, prev_value_list = True, [0, 0]
        button = self.driver.find_element_by_xpath("//button[@class='button btn-load-more']")

        while run_check:
            quantity_of_loaded_starttups =  len(self.driver.find_elements_by_xpath(
                        "//div[@class='startup-block startup-list-item']"))
            print('Loading, {} startups loaded'.format(quantity_of_loaded_starttups))
            prev_value_list.append(quantity_of_loaded_starttups)
            timer = 0
            while (not button.is_displayed()):
                time.sleep(0.1)
                timer +=0.1
                print(timer)
                if timer == 60:
                    run_check = False
                    break


            button.click()

            if prev_value_list[-2] == prev_value_list[-1] and  prev_value_list[-3]  == prev_value_list[-1]:
                run_check = False


        company_names, e_urls,  = [], []
        for item in self.driver.find_elements_by_xpath("//div[@class='startup-block startup-list-item']"):
            name = item.find_element_by_css_selector('.company-name').text
            e27url = item.find_element_by_css_selector(".startuplink").get_attribute("href")

            yield {"Startup":name,"Url":e27url}

You can go e27.co/startups and check it by yourself.

Thanks, qwew


Solution

  • you can directly access their API by finding where the request is receiving from by pressing the Load More button. In this case, the request is receiving from the following URL.

    https://e27.co/api/startups/?tab_name=recentlyupdated&start=10&length=10
    

    Hence by doing a bit of modification to the length and start, you can get more URLs. I've written a simple script to get the name of the startups.

    import requests
    
    start_number = 0
    r = requests.get('https://e27.co/api/startups/?tab_name=recentlyupdated&start={}&length=100'.format(start_number))
    r = r.json()
    for i in r['data']['list']:
        print(i['name'])
    
    #outputs
    RESYNC Technologies
    Swizzle
    Sports365
    ShopClues
    Symantec
    SpoonJoy
    SEOPRO India
    Solarium
    SHOPLINE
    Structo
    Coc Coc
    CarDekho
    Chillr
    Culture Machine
    CoAssets
    CoinMKT
    CimplyFive
    Call Levels
    CereBrahm Innovations
    CouponzGuru
    Aisle
    adMingle
    AppsFlyer
    AppVirality
    Ambient Digital
    Airtel
    Apptopia
    Latize
    Lefora
    LINC 360
    LogisticsIndonesia
    LogicGateOne Corporation
    Livspace
    LivePhuket
    LINE Ventures
    National Tiles-Sydney
    National Tiles-Brisbane
    National Research Foundation
    National Tiles
    National Tiles-Adelaide
    National University of Singapore School of Computing
    National Tiles-Wagga Wagga
    National Tiles-Springwood
    National Tiles-Burleigh Heads
    Nationkart
    Natasha
    Naturally Yours
    Native5
    Nativfy
    NaturalMantra
    Native Tongue
    NewsHunt
    Nimble Wireless
    Nanarokom.com
    NoBroker
    News Corp
    Naxos International
    NecesCity
    NextGen
    Notey
    Naspers Group
    NAM TRIP TRAVEL
    Navigat Group
    Nanosatisfi
    Naaptol
    Single Thailand
    sinhasoft
    Sinergy
    Singsys Pte. Ltd.
    Simplilearn
    SIFS India
    Simprosys InfoMedia
    SimiCommerce
    SingPost
    Singapore Press Holdings
    SimplerCloud
    SingSaver
    Sinoze
    Singapore infocomm Technology Federation
    Native Tech
    Novelship
    AthenaDesk
    ZERO BrandCard™
    Open24.vn
    iMyanmarHouse
    Shufti Pro
    MobME Wireless
    Moolya Testing
    Mofang Gongyu
    Moff Inc.
    Moonfrog Labs
    myNoticePeriod
    MaGIC
    Momoe
    Manthan
    Metaps
    Motorola Solutions
    MatchMove
    Mondano
    MOL- Money Online