Search code examples
pythonweb-scrapingbeautifulsoupblockquote

BeautifulSoup - Getting all the child from tag instead of the first


I am creating a script that collects data from a website. However I am getting some issues to collect only specific information. The HTML part that is causing me problems is the following:

<div class="Content">
  <article>
    <blockquote class="messageText 1234">
      I WANT THIS
      <br/>
      I WANT THIS 2
      <br/>
      </a>
      <br/>
    </blockquote>
  </article>
</div>
<div class="Content">
  <article>
    <blockquote class="messageText 1234">
      <a class="IDENTIFIER" href="WEBSITE">

      </a>
      NO WANT THIS
      <br/>
      <br/>
      NO WANT THIS
      <br/>
      <br/>
      NO WANT THIS
      <div class="messageTextEndMarker">
      </div>
    </blockquote>
  </article>
</div>

And I am trying to create a process that prints only the part "I WANT THIS". I have the following script:

import requests
from bs4 import BeautifulSoup

url = ''
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')

for a in soup.find_all('div', class_='panels'):
    for b in a.find_all('form', class_='section'):
            for c in b.find_all('div', class_='message'):
                    for d in c.find_all('div', class_='primaryContent'):
                             for d in d.find_all('div', class_='messageContent'):
                                     for e in d.content.find_all('blockquote', class_='messageText 1234')[0]:
                                        print(e.string)

My idea with the code was to extract only the part from the first blockquote element, however, I am getting all the text from the blockquotes:

 I WANT THIS
 NO WANT THIS

NO WANT THIS

NO WANT THIS

How can I achieve this?


Solution

  • Why not use select_one to isolate first block then stripped_strings to separate out text strings?

    from bs4 import BeautifulSoup as bs
    
    html = ''' your html'''
    soup = bs(html, 'lxml')
    print([s for s in soup.select_one('.Content .messageText').stripped_strings])