Search code examples
beautifulsoupextractcpu-wordvisiblenewspaper3k

newsletter3k, find author name in visible text after first "by" word


Newsletter3K is a good python Library for News content extraction. It works mostly well .I want to extract names after first "by" word in visible text. This is my code, it did not work well, somebody out there please help:

import re
from newspaper import Config
from newspaper import Article
from newspaper import ArticleException
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
USER_AGENT = 'Mozilla/5.0 (Macintosh;Intel Mac OS X 10.15; rv:78.0)Gecko/20100101   Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10 
html1='https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorious-food-part-2/'
article = Article(html1.strip(), config=config)
article.download()
article.parse()
soup = BeautifulSoup(article)
## I want to take only visible text
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()
for line in visible_text:
    # Capture one-or-more words after first (By or by) the initial match
    match = re.search(r'By (\S+)', line)

    # Did we find a match?
    if match:
        # Yes, process it to print 
        By = match.group(1)
        print('By {}'.format(By))`

Solution

  • This is not a comprehensive answer, but it is one that you can build from. You will need to expand this code as you add additional sources. Like I stated before my Newspaper3k overview document has lots of extraction examples, so please review it thoroughly.

    Regular expressions should be a last ditch effort after trying these extraction methods with newspaper3k:

    • article.authors
    • meta tags
    • json
    • soup
    from newspaper import Config
    from newspaper import Article
    from newspaper.utils import BeautifulSoup
    
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
    
    config = Config()
    config.browser_user_agent = USER_AGENT
    config.request_timeout = 10
    
    urls = ['https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorious-food-part-2',
            'https://www.macleans.ca/education/what-college-students-in-canada-can-expect-during-covid',
            'https://www.cnn.com/2021/02/12/asia/india-glacier-raini-village-chipko-intl-hnk/index.html',
            'https://www.latimes.com/california/story/2021-02-13/wildfire-santa-cruz-boulder-creek-residents-fear-water'
            '-quality',
            'https://foxbaltimore.com/news/local/maryland-lawmakers-move-ahead-with-first-tax-on-internet-ads-02-13-2021']
    
    for url in urls:
        try:
            article = Article(url, config=config)
            article.download()
            article.parse()
            author = article.authors
            if author:
                print(author)
            elif not author:
                soup = BeautifulSoup(article.html, 'html.parser')
                author_tag = soup.find(True, {'class': ['td-post-author-name', 'byline']}).find(['a', 'span'])
                if author_tag:
                    print(author_tag.get_text().replace('By', '').strip())
                else:
                    print('no author found')
        except AttributeError as e:
            pass