Search code examples
pythonregexlambdabeautifulsoupquotes

Scraping text containing certain caracters and names in Python?


I'm fairly new to python and working on a project in which I need all the quotes from certain people in a bunch of articles.

For this question I use this article as an example: https://www.theguardian.com/us-news/2021/oct/17/jeffrey-clark-scrutiny-trump-election-subversion-scheme

Right now, with Lambda, I am able to scrape text containing the names of the people I am looking for with the following code:

import requests
from bs4 import BeautifulSoup
url = 'https://www.theguardian.com/us-news/2021/oct/17/jeffrey-clark-scrutiny-trump-election-subversion-scheme'
response = requests.get(url)
data=response.text
soup=BeautifulSoup(data,'html.parser')
tags=soup.find_all('p')
words = ["Michael Bromwich"]
for tag in tags:
    quotes=soup.find("p",{"class":"dcr-s23rjr"}, text=lambda text: text and any(x in text for x in words)).text

print(quotes)

... which returns the block of text containing "Michael Bromwich", which in this case actually is a quote in the article. But when scraping 100+ articles, this does not do the job, as other blocks of text may also contain the indicated names without containing a quote. I only want the strings of text containing the quotes.

Therefore, my question: Is it possible to print all HTML strings under the following criteria:

Text BEGINS with the caracter " (quotation mark) OR - (hyphen) AND CONTAINS the names "Michael Bromwich" OR "John Johnson" etc.

Thank you!


Solution

  • First of all, you do not need the for tag in tags loop, you just need to use soup.find_all with your condition.

    Next, you can check for the quotation marks or hyphen without any regex:

    quotes = [x.text for x in  soup.find_all("p",{"class":"dcr-s23rjr"}, text=lambda t: t and (t.startswith("“") or t.startswith('"') or t.startswith("-")) and any(x in t for x in words))]
    

    The (t.startswith("“") or t.startswith('"') or t.startswith("-")) part will check if the text starts with , " or -.

    Or,

    quotes = [x.text for x in  soup.find_all("p",{"class":"dcr-s23rjr"}, text=lambda t: t and t.strip()[0] in '“"-' and any(x in t for x in words))]
    

    The t.strip()[0] in '“"-' part checks if the “"- contains the first char of the stripped text value.