Search code examples
pythondjangoherokucronapscheduler

ISSUES Defining Cron jobs in Procfile (Heroku) using apscheduler for Django project


I am having a problem scheduling a cron job which requires scraping a website and storing it as part of the model (MOVIE) in the database.

The problem is that the model seems to get loaded before Procfile is executed.
How should I create a cron job which runs internally in the background and storing scraped information into the database? Here are my codes:

Procfile:

    web: python manage.py runserver 0.0.0.0:$PORT
    scheduler: python cinemas/scheduler.py

scheduler.py:

# More code above
from cinemas.models import Movie
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()

@sched.scheduled_job('cron', day_of_week='mon-fri', hour=0, minutes=26)    
def get_movies_playing_now():
  global url_movies_playing_now
  Movie.objects.all().delete()
  while(url_movies_playing_now):
    title = []
    description = []
    #Create BeatifulSoup Object with url link
    s = requests.get(url_movies_playing_now, headers=headers)
    soup = bs4.BeautifulSoup(s.text, "html.parser")
    movies = soup.find_all('ul', class_='w462')[0]

    #Find Movie's title
    for movie_title in movies.find_all('h3'):
        title.append(movie_title.text)
    #Find Movie's description
    for movie_description in soup.find_all('ul',
                                           class_='w462')[0].find_all('p'):
        description.append(movie_description.text.replace(" [More]","."))

    for t, d in zip(title, description):
        m = Movie(movie_title=t, movie_description=d)
        m.save()

    #Go to the next page to find more movies
    paging = soup.find( class_='pagenating').find_all('a', class_=lambda x:
                                                      x != "inactive")
    href = ""
    for p in paging:
        if "next" in p.text.lower():
            href = p['href']
    url_movies_playing_now = href

sched.start()
# More code below

cinemas/models.py:

from django.db import models

#Create your models here.

class Movie(models.Model):
    movie_title = models.CharField(max_length=200)
    movie_description = models.CharField(max_length=20200)

This is the error i am getting when the Job is ran.

2016-11-17T17:57:06.074914+00:00 app[scheduler.1]: Traceback (most recent call last): 2016-11-17T17:57:06.074931+00:00 app[scheduler.1]: File "cinemas/scheduler.py", line 2, in 2016-11-17T17:57:06.075058+00:00 app[scheduler.1]: import cineplex 2016-11-17T17:57:06.075060+00:00 app[scheduler.1]: File "/app/cinemas/cineplex.py", line 1, in 2016-11-17T17:57:06.075173+00:00 app[scheduler.1]: from cinemas.models import Movie 2016-11-17T17:57:06.075196+00:00 app[scheduler.1]: File "/app/cinemas/models.py", line 5, in 2016-11-17T17:57:06.075295+00:00 app[scheduler.1]: class Movie(models.Model): 2016-11-17T17:57:06.075297+00:00 app[scheduler.1]: File "/app/.heroku/python/lib/python3.5/site-packages/django/db/models/base.py", line 105, in new 2016-11-17T17:57:06.075414+00:00 app[scheduler.1]: app_config = apps.get_containing_app_config(module) 2016-11-17T17:57:06.075440+00:00 app[scheduler.1]: File "/app/.heroku/python/lib/python3.5/site-packages/django/apps/registry.py", line 237, in get_containing_app_config 2016-11-17T17:57:06.075585+00:00 app[scheduler.1]:
self.check_apps_ready() 2016-11-17T17:57:06.075586+00:00 app[scheduler.1]: File "/app/.heroku/python/lib/python3.5/site-packages/django/apps/registry.py", line 124, in check_apps_ready 2016-11-17T17:57:06.075703+00:00 app[scheduler.1]: raise AppRegistryNotReady("Apps aren't loaded yet.") 2016-11-17T17:57:06.075726+00:00 app[scheduler.1]: django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.

Cron job works fine if I do not include Model objects. How should I run this job every day using Model objects without failing?

Thanks


Solution

  • That's because you can't just import the Django packages, models, etc.
    In order to work properly, the Django internals require initialization, that's triggered from manage.py.

    Rather than try and re-create all that myself, I always write long-running, non-web commands as a custom management command.

    For example, if your app is cinemas, you would:

    • Create ./cinemas/management/commands/scheduler.py.
    • In that file, create a sub-class django.core.management.base.BaseCommand (that sub-class must be called Command)
    • In that class, override handle(). In your case, that's where you'd call sched.start()
    • Your Procfile would then have scheduler: python manage.py scheduler

    Hope that helps.