Search code examples
pythondatetimeweb-scrapingdata-extraction

Extract date from multiple webpages with Python


I want to extract date when news article was published on websites. For some websites I have exact html element where date/time is (div, p, time) but on some websites I do not have:

These are the links for some websites (german websites):

(3 Nov 2020) http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226

(Dec. 1, 2020) http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0&sq=&kategorie_id=&date_from=&date_to=

(10/22/2020) http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905

I have tried 3 different solutions with Python libs such as requests, htmldate and date_guesser but I'm always getting None, or in case of htmldate lib, I always get same date (2020.1.1)

from bs4 import BeautifulSoup
import requests
from htmldate import find_date
from date_guesser import guess_date, Accuracy

# Lib find_date
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
response = requests.get(url)
my_date = find_date(response.content, extensive_search=True)
print(my_date, '\n')


# Lib guess_date
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
my_date = guess_date(url=url, html=requests.get(url).text)
print(my_date.date, '\n')


# Lib Requests # I DO NOT GET last modified TAG
my_date = requests.head('http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226')
print(my_date.headers, '\n')

Am I doing something wrong?

Can you please tell me is there a way to extract date of publication from websites like this (where I do not have specific divs, p, and datetime elements).

IMPORTANT! I want to make universal date extraction, so that I can put these links in for loop and run the same function to them.


Solution

  • I have never had much success with some of the date parsing libraries, so I usually go another route. I believe that the best method to extract the date strings from these sites in your question is with regular expressions.

    website: linden.ch

    import requests
    import re as regex
    from bs4 import BeautifulSoup
    from datetime import datetime
    
    url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    page_body = soup.find('body')
    find_date = regex.search(r'(Datum der Neuigkeit)\s(\d{1,2}\W\s\w+\W\s\d{4})', str(page_body))
    reformatted_timestamp = datetime.strptime(find_date.groups()[1], '%d. %b. %Y').strftime('%d-%m-%Y')
    print(reformatted_timestamp)
    # print output 
    03-11-2020
    

    website: buchholterberg.ch

    import requests
    import re as regex
    from bs4 import BeautifulSoup
    from datetime import datetime
    
    url = "http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    page_body = soup.find('body')
    find_date = regex.search(r'(Veröffentlicht)\s\w+:\s(\d{1,2}:\d{1,2}:\d{1,2})\s(\d{1,2}.\d{1,2}.\d{4})', str(page_body))
    reformatted_timestamp = datetime.strptime(find_date.groups()[2], '%d.%m.%Y').strftime('%d-%m-%Y')
    print(reformatted_timestamp)
    # print output
    22-10-2020
    

    Update 12-04-2020

    I looked at the source code for the two Python libraries: htmldate and date_guesser that you mentioned. Neither of these libraries can currently extract the date from the 3 sources that you listed in your question. The primary reason for this lack of extraction is linked to the date formats and language (german) of these target sites.

    I had some free time so I put this together for you. The answer below can easily be modified to extract from any website and can be refined as needed based on the format of your target sources. It currently extract from all the links contained in URLs.


    all urls

    import requests
    import re as regex
    from bs4 import BeautifulSoup
    
    def extract_date(can_of_soup):
       page_body = can_of_soup.find('body')
       clean_body = ''.join(str(page_body).replace('\n', ''))
       if 'Datum der Neuigkeit' in clean_body or 'Veröffentlicht' in clean_body:
         date_formats = '(Datum der Neuigkeit)\s(\d{1,2}\W\s\w+\W\s\d{4})|(Veröffentlicht am: \d{2}:\d{2}:\d{2} )(\d{1,2}.\d{1,2}.\d{4})'
         find_date = regex.search(date_formats, clean_body, regex.IGNORECASE)
         if find_date:
            clean_tuples = [i for i in list(find_date.groups()) if i]
            return ''.join(clean_tuples[1])
       else:
           tags = ['extra', 'elementStandard elementText', 'icms-block icms-information-date icms-text-gemeinde-color']
           for tag in tags:
              date_tag = page_body.find('div', {'class': f'{tag}'})
              if date_tag is not None:
                children = date_tag.findChildren()
                if children:
                    find_date = regex.search(r'(\d{1,2}.\d{1,2}.\d{4})', str(children))
                    return ''.join(find_date.groups())
                else:
                    return ''.join(date_tag.contents)
    
    
    def get_soup(target_url):
       response = requests.get(target_url)
       soup = BeautifulSoup(response.content, 'html.parser')
       return soup
    
    
    urls = {'http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226',
        'http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0'
        '&sq=&kategorie_id=&date_from=&date_to=',
        'http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905',
        'https://www.steffisburg.ch/de/aktuelles/meldungen/Hochwasserschutz-und-Laengsvernetzung-Zulg.php',
        'https://www.wallisellen.ch/aktuellesinformationen/924227',
        'http://www.winkel.ch/de/aktuellesre/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id'
        '=1093910&ls=0&sq=&kategorie_id=&date_from=&date_to=',
        'https://www.aeschi.ch/de/aktuelles/mitteilungen/artikel/?tx_news_pi1%5Bnews%5D=87&tx_news_pi1%5Bcontroller%5D=News&tx_news_pi1%5Baction%5D=detail&cHash=ab4d329e2f1529d6e3343094b416baed'}
    
    
    for url in urls:
       html = get_soup(url)
       article_date = extract_date(html)
       print(article_date)