Search code examples
pythonwindowsflaskscheduler

How can I reload the URL data within python script, or re-run the script hourly?


I am scraping data from a URL where the content often changes and serving up a page in Flask. What is the best strategy to re-scrape the data and send it to Flask every hour? Note: running this in a virtual env in Windows cmd.

  • Should I use APSchedule?
  • Windows task scheduler? (And if so how would I kill the current running script?)
  • Or is there some way to periodically reload and update the data within the script? (And if so can you please show a specific implementation in the script below? I am struggling to learn python.)

I tried some examples with APSchedule, but had no luck. Code below

#News feed test for Xibo Signage
from flask import Flask, render_template
from markupsafe import Markup
app=Flask(__name__) 
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
from datetime import datetime

# datetime object containing current date and time
# dd/mm/YY H:M:S
now = datetime.now()
current_time = now.strftime("%d/%m/%Y %H:%M:%S")

url = "https://news.clemson.edu/tag/extension/"
soup = BeautifulSoup(requests.get(url, headers={'user-agent':'Mozilla/5.0'}).text)
picture=[]
for e in soup.select('article img.lazyload'):
    sorce = (e.get('data-src'))
    picture.append(sorce)


title=[]
for e in soup.select('article header'): 
    etitle =  (e.find("h3", class_="entry-title bold").text)
    title.append(etitle)
    #    print(e.find("h3", class_="entry-title bold"))
pictures = picture
titles = title

@app.route('/') 
def home():
    return render_template('home.html',pictures=pictures, titles=titles, current_time=current_time)

if __name__ == '__main__':

    app.run(host='0.0.0.0')
    app.run(debug=True)

I cannot figure how to properly put the functions into the scheduler job. I got the script to run as follows, but it is not updating the time/date on the flask page.

from apscheduler.schedulers.background import BackgroundScheduler

now = datetime.now()
current_time = now.strftime("%d/%m/%Y %H:%M:%S")

def sensor():
    """ Function for test purposes. """
    now = datetime.now()
    current_time = now.strftime("%d/%m/%Y %H:%M:%S")
    print("Scheduler is alive!")

sched = BackgroundScheduler(daemon=True)
sched.add_job(sensor,'interval',minutes=5)
sched.start()

It only sets the datetime once when I initially run the script.


Solution

  • Normally, I'd run the job as a one-off script with Windows Task Scheduler, cron, or a GitHub Action and write the data to a file. Then, I'd have the Flask app read the file and serve the data when requested.

    But if you want the task to run within the Flask app, optionally without file persistence, you can arrange it as follows using Flask-APScheduler:

    # python 3.10.12
    from datetime import datetime
    from flask import Flask # 3.0.2
    from flask_apscheduler import APScheduler # 1.13.1
    
    
    class Config:
        SCHEDULER_API_ENABLED = True
    
    
    class Time:
        now = datetime.now()
    
    
    app = Flask(__name__)
    app.config.from_object(Config())
    scheduler = APScheduler()
    scheduler.init_app(app)
    scheduler.start()
    
    
    @scheduler.task("interval", id="sensor", seconds=5)
    def sensor():
        Time.now = datetime.now()
        print("job ran:", Time.now)
    
    
    @app.get("/")
    def index():
        return {"last_job_time": Time.now.strftime("%d/%m/%Y %H:%M:%S")}
    
    
    if __name__ == "__main__":
        app.run()
    

    Sample output after starting the server:

    $ curl localhost:5000
    {"last_job_time":"13/03/2024 12:10:11"}
    $ curl localhost:5000
    {"last_job_time":"13/03/2024 12:10:11"}
    $ curl localhost:5000
    {"last_job_time":"13/03/2024 12:10:16"}
    $ curl localhost:5000
    {"last_job_time":"13/03/2024 12:10:16"}