i am Newbie to Soup/python and i am trying to find the data.
my website Structure look like this.
and if i open the Divclass border class
it look like this.(image below)
I have done something like this:
for P in soup.find_all('p', attrs={'class': 'bid_no pull-left'}) :
print(P.find('a').contents[0])
a Div structure look like
there are around 10 div in each page
this in which i want to extract the Items,Quantity Require,Bid number,End date.
Please help me
<div class="border block " style="display: block;">
<div class="block_header">
<p class="bid_no pull-left"> BID NO: <a style="color:#fff !important" href="/showbidDocument/1844736">GEM/2020/B/763154</a></p>
<p class="pull-right view_corrigendum" data-bid="1844736" style="display:none; margin-left: 10px;"><a href="#">View Corrigendum</a></p>
<div class="clearfix"></div>
</div>
<div class="col-block">
<p><strong style="text-transform: none !important;">Item(s): </strong><span>Compatible Cartridge</span></p>
<p><strong>Quantity Required: </strong><span>8</span></p>
<div class="clearfix"></div>
</div>
<div class="col-block">
<p><strong>Department Name And Address:</strong></p>
<p class="add-height">
Ministry Of Railways<br> Na<br> South Central Railway N/a
</p>
<div class="clearfix"></div>
</div>
<div class="col-block">
<p><strong>Start Date: </strong><span>25-08-2020 02:54 PM</span></p>
<p><strong>End Date: </strong><span>04-09-2020 03:00 PM</span></p>
<div class="clearfix"></div>
</div>
<div class="clearfix"></div>
</div>
Error image
Try the below approach using requests and beautiful soup. I have created the script with the URL which is fetched from website and then creating a dynamic URL to traverse each and every page to get the data.
What exactly script is doing:
First script will create a URL where page_no query string parameter will increment by 1 upon completion of each traversal.
Requests will get the data from the created URL using get method which will then pass to beautiful soup to parse HTML structure using lxml.
Then from the parsed data script will search for the div where data is actually present.
Finally looping on all the div text data one by one for each page.
```python
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
def scrap_bid_data():
page_no = 1 #initial page number
while True:
print('Hold on creating URL to fetch data...')
URL = 'https://bidplus.gem.gov.in/bidlists?bidlists&page_no=' + str(page_no) #create dynamic URL
print('URL cerated: ' + URL)
scraped_data = requests.get(URL,verify=False) # request to get the data
soup_data = bs(scraped_data.text, 'lxml') #parse the scraped data using lxml
extracted_data = soup_data.find('div',{'id':'pagi_content'}) #find divs which contains required data
if len(extracted_data) == 0: # **if block** which will check the length of extracted_data if it is 0 then quit and stop the further execution of script.
break
else:
for idx in range(len(extracted_data)): # loops through all the divs and extract and print data
if(idx % 2 == 1): #get data from odd indexes only because we have required data on odd indexes
bid_data = extracted_data.contents[idx].text.strip().split('\n')
print('-' * 100)
print(bid_data[0]) #BID number
print(bid_data[5]) #Items
print(bid_data[6]) #Quantitiy Required
print(bid_data[10] + bid_data[12].strip()) #Department name and address
print(bid_data[16]) #Start date
print(bid_data[17]) #End date
print('-' * 100)
page_no +=1 #increments the page number by 1
scrap_bid_data()
```
Actual Code
Output image