I want to scrape the all startups name from https://e27.co/startups/. You can see that by default there are 20 names of startups, and to load more you need to click by "Load more" button. This button loads 10 startup names
I have created the python script which clicking the "Load More" button until all(29000) startups will be loaded. Its takes a lot of time and RAM. How can I load this data without this clicking?
I heard something called by AJAX request but I don't understand how to implement this.
Html code of button:
<button class="button btn-load-more" data-start="0">Load More</button>
data-start parameter is changing +10 with one click
event code of button (JS)
startupList.elem.find('.btn-load-more').off('.click').click(function(){
startupList.elem.find('.btn-load-more').addClass('hide');
Global.loading();
startupList.loadMoreIsClicked = true;
var start = $(this).attr('data-start')*1;
start += startupList.count;
$(this).attr('data-start', start);
startupList.searchAndFilterResult(start, startupList.getFormData("#startup_search"), false);
My python code:
def __init__(self):
opp = Options()
opp.add_argument('--blink-settings=imagesEnabled=false')
opp.add_argument('--headless')
self.driver = webdriver.Chrome('./chromedriver', chrome_options=opp)
def parse(self, e27_url = "https://e27.co/startups/"):
self.driver.get(e27_url)
time.sleep(3)
run_check, prev_value_list = True, [0, 0]
button = self.driver.find_element_by_xpath("//button[@class='button btn-load-more']")
while run_check:
quantity_of_loaded_starttups = len(self.driver.find_elements_by_xpath(
"//div[@class='startup-block startup-list-item']"))
print('Loading, {} startups loaded'.format(quantity_of_loaded_starttups))
prev_value_list.append(quantity_of_loaded_starttups)
timer = 0
while (not button.is_displayed()):
time.sleep(0.1)
timer +=0.1
print(timer)
if timer == 60:
run_check = False
break
button.click()
if prev_value_list[-2] == prev_value_list[-1] and prev_value_list[-3] == prev_value_list[-1]:
run_check = False
company_names, e_urls, = [], []
for item in self.driver.find_elements_by_xpath("//div[@class='startup-block startup-list-item']"):
name = item.find_element_by_css_selector('.company-name').text
e27url = item.find_element_by_css_selector(".startuplink").get_attribute("href")
yield {"Startup":name,"Url":e27url}
You can go e27.co/startups and check it by yourself.
Thanks, qwew
you can directly access their API by finding where the request is receiving from by pressing the Load More button. In this case, the request is receiving from the following URL.
https://e27.co/api/startups/?tab_name=recentlyupdated&start=10&length=10
Hence by doing a bit of modification to the length
and start
, you can get more URLs. I've written a simple script to get the name of the startups.
import requests
start_number = 0
r = requests.get('https://e27.co/api/startups/?tab_name=recentlyupdated&start={}&length=100'.format(start_number))
r = r.json()
for i in r['data']['list']:
print(i['name'])
#outputs
RESYNC Technologies
Swizzle
Sports365
ShopClues
Symantec
SpoonJoy
SEOPRO India
Solarium
SHOPLINE
Structo
Coc Coc
CarDekho
Chillr
Culture Machine
CoAssets
CoinMKT
CimplyFive
Call Levels
CereBrahm Innovations
CouponzGuru
Aisle
adMingle
AppsFlyer
AppVirality
Ambient Digital
Airtel
Apptopia
Latize
Lefora
LINC 360
LogisticsIndonesia
LogicGateOne Corporation
Livspace
LivePhuket
LINE Ventures
National Tiles-Sydney
National Tiles-Brisbane
National Research Foundation
National Tiles
National Tiles-Adelaide
National University of Singapore School of Computing
National Tiles-Wagga Wagga
National Tiles-Springwood
National Tiles-Burleigh Heads
Nationkart
Natasha
Naturally Yours
Native5
Nativfy
NaturalMantra
Native Tongue
NewsHunt
Nimble Wireless
Nanarokom.com
NoBroker
News Corp
Naxos International
NecesCity
NextGen
Notey
Naspers Group
NAM TRIP TRAVEL
Navigat Group
Nanosatisfi
Naaptol
Single Thailand
sinhasoft
Sinergy
Singsys Pte. Ltd.
Simplilearn
SIFS India
Simprosys InfoMedia
SimiCommerce
SingPost
Singapore Press Holdings
SimplerCloud
SingSaver
Sinoze
Singapore infocomm Technology Federation
Native Tech
Novelship
AthenaDesk
ZERO BrandCard™
Open24.vn
iMyanmarHouse
Shufti Pro
MobME Wireless
Moolya Testing
Mofang Gongyu
Moff Inc.
Moonfrog Labs
myNoticePeriod
MaGIC
Momoe
Manthan
Metaps
Motorola Solutions
MatchMove
Mondano
MOL- Money Online