Search code examples
htmlbeautifulsoupxml-parsinghtml-parsing

How to parse information from same class using Beautifulsoup?


Suppose I have the following HTML

html_doc = """

    <html>
    <head>
    <title>Page Title</title>
    </head>
    <body>
    
    <div class = "Box1">
      <span class = "catagory">Plant</span>
        <div class = "Box2">
          <span class = "sub-catagory">Trees</span>
            <div class = "characters">
              <div class = "font-medium">1.2</div>
              <div class = "font-medium">1.6</div>
              <div class = "font-medium">1.7</div>
              <div class = "font-medium">1.8</div>
              <div class = "font-medium">1.9</div>
              <div class = "font-medium">1.4</div>
            </div>
          <span class = "sub-catagory">Flowers</span>
            <div class = "characters">
              <div class = "font-medium">2.2</div>
              <div class = "font-medium">3.6</div>
              <div class = "font-medium">4.7</div>
              <div class = "font-medium">5.8</div>
              <div class = "font-medium">6.9</div>
              <div class = "font-medium">7.4</div>
            </div>
          </div>
      <span class = "catagory">animals</span>
        <div class = "Box2">
          <span class = "sub-catagory">human</span>
            <div class = "characters">
              <div class = "font-medium">7.2</div>
              <div class = "font-medium">9.6</div>
              <div class = "font-medium">4.7</div>
              <div class = "font-medium">3.8</div>
              <div class = "font-medium">6.9</div>
              <div class = "font-medium">9.4</div>
            </div>
          <span class = "sub-catagory">dog</span>
            <div class = "characters">
              <div class = "font-medium">4.2</div>
              <div class = "font-medium">5.6</div>
              <div class = "font-medium">6.7</div>
              <div class = "font-medium">1.8</div>
              <div class = "font-medium">3.9</div>
              <div class = "font-medium">8.4</div>
            </div>
          </div>
        <span class = "catagory">non-living</span>
        <div class = "Box2">
          <span class = "sub-catagory">rock</span>
            <div class = "characters">
              <div class = "font-medium">1.2</div>
              <div class = "font-medium">1.6</div>
              <div class = "font-medium">4.7</div>
              <div class = "font-medium">6.8</div>
              <div class = "font-medium">1.9</div>
              <div class = "font-medium">0.4</div>
            </div>
          <span class = "sub-catagory">stars</span>
            <div class = "characters">
              <div class = "font-medium">3.2</div>
              <div class = "font-medium">5.6</div>
              <div class = "font-medium">2.7</div>
              <div class = "font-medium">4.8</div>
              <div class = "font-medium">1.9</div>
              <div class = "font-medium">2.4</div>
            </div>
          </div>
      </div>
    </div>
    </body>
    </html>

"""

Using the BeautifSoup package for Python, I am able to get the category, subcategory, characters separately as shown below:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser')
    catagories = soup.find_all('span',class_='catatory')
    for catatory in categories:
        print(catagory.get_text()) #gives the Plant, Animal, non-living
    sub-catatories = soup.find_all('span',class_='sub-catatory')
    for sub-catatory in sub-categories:
        print(sub-catagory.get_text()) # gives me subcategories
    measurements = soup.find_all('div',class_='font-medium')
    for measurement in measurements:
        print(measurement.get_text()) # gives me all the font-medium together.

I am not sure how to get the following result since div classes are all same. Please help

Plant Trees 1.2 1.6 1.7 1.8 1.9 1.4 Flowers 2.2 3.6 4.7 5.8 6.9 7.4 animals human 7.2 9.6 4.7 3.8 6.9 9.4 dog 4.2 5.6 6.7 1.8 3.9 8.4 non-living rock 1.2 1.6 4.7 6.8 1.9 0.4 stars 3.2 5.6 2.7 4.8 1.9 2.4


Solution

  • Getting your texts printed in the expected way, select your Box1 and extract text with get_text() while setting its seperat / join parameter to \n:

    print(soup.select_one('.Box1').get_text('\n',strip=True))
    
    Plant
    Trees
    1.2
    1.6
    1.7
    1.8
    1.9
    1.4
    Flowers
    2.2
    3.6
    4.7
    5.8
    6.9
    7.4
    animals
    ...
    

    To get a more structured output change the way fetching your elements:

    for e in soup.select('span.sub-catagory'):
        data.append({
            'cat': e.find_previous('span',{'class':'catagory'}).text,
            'subcat': e.text,
            'characters': list(e.find_next('div').stripped_strings)
        })
    
    Example
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser')
    
    data = []
    
    for e in soup.select('span.sub-catagory'):
        print()
        data.append({
            'cat': e.find_previous('span',{'class':'catagory'}).text,
            'subcat': e.text,
            'characters': list(e.find_next('div').stripped_strings)
        })
    data
    
    Output
    [{'cat': 'Plant',
      'subcat': 'Trees',
      'characters': ['1.2', '1.6', '1.7', '1.8', '1.9', '1.4']},
     {'cat': 'Plant',
      'subcat': 'Flowers',
      'characters': ['2.2', '3.6', '4.7', '5.8', '6.9', '7.4']},
     {'cat': 'animals',
      'subcat': 'human',
      'characters': ['7.2', '9.6', '4.7', '3.8', '6.9', '9.4']},
     {'cat': 'animals',
      'subcat': 'dog',
      'characters': ['4.2', '5.6', '6.7', '1.8', '3.9', '8.4']},
     {'cat': 'non-living',
      'subcat': 'rock',
      'characters': ['1.2', '1.6', '4.7', '6.8', '1.9', '0.4']},
     {'cat': 'non-living',
      'subcat': 'stars',
      'characters': ['3.2', '5.6', '2.7', '4.8', '1.9', '2.4']}]