Search code examples
pythonweb-scrapingbeautifulsouppython-requestsexport-to-csv

How to scrape content from a website with no class or id specified in attribute with BeautifulSoup4


I want to scrape separate content like- text in 'a' tag (ie. only the name- "42mm Architecture") and 'scope of services, types of built projects, Locations of Built Projects, Style of work, Website' as CSV file headers and its content for a whole webpage.

The elements have no Class or ID associated with it. So I am kind of stuck on how to extract those details properly, also there are those 'br' and 'b' tags in between.

There are multiple 'p' tags before and after the block of code provided. Here is the website.

<h2>
  <a href="http://www.dezeen.com/tag/design-by-42mm-architecture" rel="noopener noreferrer" target="_blank">
   42mm Architecture
  </a>
  |
  <span style="color: #808080;">
   Delhi | Top Architecture Firms/ Architects in India
  </span>
 </h2>
 <!-- /wp:paragraph -->
 <p>
  <b>
   Scope of services:
  </b>
  Architecture, Interiors, Urban Design.
  <br/>
  <b>
   Types of Built Projects:
  </b>
  Residential, commercial, hospitality, offices, retail, healthcare, housing, Institutional
  <br/>
  <b>
   Locations of Built Projects:
  </b>
  New Delhi and nearby states
  <b>
   <br/>
  </b>
  <b>
   Style of work
  </b>
  <span style="font-weight: 400;">
   : Contemporary
  </span>
  <br/>
  <b>
   Website
  </b>
  <span style="font-weight: 400;">
   :
   <a href="https://www.42mm.co.in/">
    42mm.co.in
   </a>
  </span>
 </p>

So how is it done using BeautifulSoup4?


Solution

  • This one was a bit of a time consuming one! The webpage is not complete and it has less tags and identifiers. To add more on that they haven't even spell checked the content Eg. One place has a heading Scope of Services and another place has Scope of services and there are many more like that! So what I have done is a crude extraction and I'm sure it would help you if you have the idea of paginating also.

    import requests
    from bs4 import BeautifulSoup
    import csv
    
    page = requests.get('https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/')
    soup = BeautifulSoup(page.text, 'lxml')
    
    # there are many h2 tags but we want the one without any class name
    h2 = soup.find_all('h2', class_= '')
    
    headers = []
    contents = []
    header_len = []
    a_tags = []
    
    for i in h2:
        if i.find_next().name == 'a':             # to make sure we do not grab the wrong tag
            a_tags.append(i.find_next().text)
            p = i.find_next_sibling()
            contents.append(p.text)
            h =[j.text for j in  p.find_all('strong')]   #  some headings were bold in the website
            headers.append(h)
            header_len.append(len(h))
    
    # since only some headings were in bold the max number of bold would give all headers
    headers = headers[header_len.index(max(header_len))]
    
    # removing the : from headings
    headers = [i[:len(i)-1] for i in headers]
    
    # inserted a new heading
    headers.insert(0, 'Firm')
    
    # n for traversing through headers list
    # k for traversing through a_tags list
    n =1
    k =0
    
    # this is the difficult part where the content will have all the details in one value including the heading like this
    """
    Scope of services: Architecture, Interiors, Urban Design.Types of Built Projects: Residential, commercial, hospitality, offices, retail, healthcare, housing, InstitutionalLocations of Built Projects: New Delhi and nearby statesStyle of work: ContemporaryWebsite: 42mm.co.in
    """
    # thus I am splitting it using the ':' and then splicing it from the start of the each heading
    
    contents = [i.split(':') for i in contents]
    for i in contents:
        for j in i:
            h = headers[n][:5]
            if i.index(j) == 0:
                i[i.index(j)] = a_tags[k]
                n+=1
                k+=1
            elif h in j:
                i[i.index(j)] = j[:j.index(h)]
                j = j[:j.index(h)]
                if n < len(headers)-1:
                    n+=1
        n =1
    
        # merging those extra values in the list if any
        if len(i) == 7:
            i[3] = i[3] + ' ' + i[4]
            i.remove(i[4])
    
    # writing into csv file
    # if you don't want a line space between each row then add newline = '' argument in the open function below
    with open('output.csv', 'w') as f:   
        writer = csv.writer(f)
        writer.writerow(headers)
        writer.writerows(contents)
    

    This was the output:

    enter image description here

    If you want to paginate then just add the page number to the end of the url and you'll be good!

    page_num = 1
    while page_num <13:
        page = requests.get(f'https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/{page_num}/')
    
        # paste the above code starting from soup = BeautifulSoup(page.text, 'lxml')
    
        page_num +=1
    

    Hope this helps, let me know if there's any error.

    EDIT 1: I forgot to say the most important part sorry, if there is a tag with no class name then you can still get the tag with what I have used in the code above

    h2 = soup.find_all('h2', class_= '')
    

    This just says that give me all the h2 tags which does not have a class name. This itself can sometimes be a unique identifier as we are using this no class value to identify it.