Search code examples
pythonexcelweb-scrapingbeautifulsoupexport-to-csv

How to export all the Details in the Div using Beautiful soup python to excel/csv?


i am Newbie to Soup/python and i am trying to find the data. my website Structure look like this. enter image description here

and if i open the Divclass border class it look like this.(image below)

I have done something like this:

for  P in soup.find_all('p', attrs={'class': 'bid_no pull-left'}) :
  print(P.find('a').contents[0])

enter image description here

a Div structure look like
there are around 10 div in each page
this in which i want to extract the Items,Quantity Require,Bid number,End date. Please help me

<div class="border block " style="display: block;">
    <div class="block_header">
        <p class="bid_no pull-left"> BID NO: <a style="color:#fff !important" href="/showbidDocument/1844736">GEM/2020/B/763154</a></p> 
        <p class="pull-right view_corrigendum" data-bid="1844736" style="display:none; margin-left: 10px;"><a href="#">View Corrigendum</a></p>

         <div class="clearfix"></div>
    </div>

    <div class="col-block">
        <p><strong style="text-transform: none !important;">Item(s): </strong><span>Compatible Cartridge</span></p>
        <p><strong>Quantity Required: </strong><span>8</span></p>

        <div class="clearfix"></div>
    </div>
    <div class="col-block">
        <p><strong>Department Name And Address:</strong></p>
        <p class="add-height">
            Ministry Of Railways<br> Na<br> South Central Railway N/a
        </p>
        <div class="clearfix"></div>
    </div>
    <div class="col-block">
        <p><strong>Start Date: </strong><span>25-08-2020 02:54 PM</span></p>
        <p><strong>End Date: </strong><span>04-09-2020 03:00 PM</span></p>
        <div class="clearfix"></div>

    </div>


    <div class="clearfix"></div>
</div>

Error image

Error enter image description here

enter image description here


Solution

  • Try the below approach using requests and beautiful soup. I have created the script with the URL which is fetched from website and then creating a dynamic URL to traverse each and every page to get the data.

    What exactly script is doing:

    1. First script will create a URL where page_no query string parameter will increment by 1 upon completion of each traversal.

    2. Requests will get the data from the created URL using get method which will then pass to beautiful soup to parse HTML structure using lxml.

    3. Then from the parsed data script will search for the div where data is actually present.

    4. Finally looping on all the div text data one by one for each page.

      ```python
      import requests
      from urllib3.exceptions import InsecureRequestWarning
      requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
      from bs4 import BeautifulSoup as bs
      
      def scrap_bid_data():
      
      page_no = 1 #initial page number
      while True:
          print('Hold on creating URL to fetch data...')
          URL = 'https://bidplus.gem.gov.in/bidlists?bidlists&page_no=' + str(page_no) #create dynamic URL
          print('URL cerated: ' + URL)
      
          scraped_data = requests.get(URL,verify=False) # request to get the data
          soup_data = bs(scraped_data.text, 'lxml') #parse the scraped data using lxml
          extracted_data = soup_data.find('div',{'id':'pagi_content'}) #find divs which contains required data
      
          if len(extracted_data) == 0: # **if block** which will check the length of extracted_data if it is 0 then quit and stop the further execution of script.
              break
          else:
              for idx in range(len(extracted_data)): # loops through all the divs and extract and print data
                  if(idx % 2 == 1): #get data from odd indexes only because we have required data on odd indexes
                      bid_data = extracted_data.contents[idx].text.strip().split('\n')
                      print('-' * 100)
                      print(bid_data[0]) #BID number
                      print(bid_data[5]) #Items
                      print(bid_data[6]) #Quantitiy Required
                      print(bid_data[10] + bid_data[12].strip()) #Department name and address
                      print(bid_data[16]) #Start date
                      print(bid_data[17]) #End date                   
                      print('-' * 100)
      
              page_no +=1 #increments the page number by 1
      
       scrap_bid_data()
       ```
      

    Actual Code

    enter image description here

    Output image

    enter image description here