Search code examples
djangohdfslivydjango-cron

I need help using django_cron


I am currently working with HDFS, Apache Livy and Django, the goal of this is to send a request to get some code running which is stored in HDFS and which calls Livy to create Batches. For now, everything is working, I have a basic wordcount stored in HDFS, with a .txt file, and on a htlm page I just have a simple button to click on to launch the whole process.

I succeed in creating the wordcount result, and my next step is to get informations from Livy, for instance the ID of the sessions (or batches) currently starting/running/dead/success some sort of callback, but I need the it to self actualize so I can know what states are every sessions in. To do so, I thought I could use Django-cron, therefore I can't manage to set it correctly. I have no errors but nothing more is happening. What am I missing ?

Currently working on Centos7 but I'm using a Conda environment in Python 3.6, with Django latest release, so are livy and HDFS (latest release)

Here are my current files :

livy.html

{% load static %}

<html>
<body>
<div id="div1">

{{result.sessions}}

</div>

<form action="#" method="get">
 <input type="text" name="mytextbox" />
 <input type="submit" class="btn" value="Click" name="mybtn">
</form>

</body>
</html>

views.py

from django.shortcuts import render
from django.http import HttpResponse
from django_cron import CronJobBase, Schedule
import wordcount, livy

# Create your views here.

class CheckIdCronJob(CronJobBase):
    RUN_EVERY_MINS = 1 # every minute

    schedule = Schedule(run_every_mins=RUN_EVERY_MINS)
    code = 'button.CheckIdCronJob'    # a unique code

    def index(request):
        if(request.GET.get('mybtn')):
            r = livy.send(request.GET.get('mytextbox')) #(/test/LICENSE.txt)
            return render(request,'button/livy.html', {'result':r})
        return render(request,'button/livy.html')

livy.py

import json, pprint, requests, textwrap

def send(inputText):
    host = 'http://localhost:8998'
    data = {"file":"/myapp/wordcount.py", "args":[inputText,"2"]}
    headers = {'Content-Type': 'application/json'}
    r = requests.post(host + '/batches', data=json.dumps(data), headers=headers)
    r = requests.get(host + '/batches' + '', data=json.dumps(data), headers=headers)
    return r.json()

Solution

  • What django-crontab does is just make it easy to write management commands that run a job and specify how often/when these jobs should run. You end up with one management command ./manage.py runcron that will check all your jobs and run them if needed.

    What it doesn't do is continuously runcron, which is what you actually need if you want to make sure your jobs run at the right moment. Basically, you want runcron to run every minute (or if the time is not that critical every 10 minutes) for example, so you still need to use some system daemon that will do that.

    crontab is available on CentOS and can be used for just that purpose. The installation of django-crontab shows you an example of how to create a crontab that will run runcron every 5 minutes:

    crontab -e
    */5 * * * * source /home/ubuntu/.bashrc && source /home/ubuntu/work/your-project/bin/activate && python /home/ubuntu/work/your-project/src/manage.py runcrons > /home/ubuntu/cronjob.log
    

    You have to adapt that to fit your use case:

    • If you just do crontab -e ... the job will run as the user you're currently logged in as. That might not be the right user to run the manage.py command, since that user needs to have the correct permissions to run your project. Use -u user to make the crontab for a different user.

      This is actually the complicated thing when running in production: Getting user permissions correct and getting the right user to run the various tasks. Normally you'd have a www-data or apache user that's running your server (and hence django app) and you want that same user to run the manage.py command. It should not be root running apache as that opens up security risks (your web server would have full access to the entire system).

    • The above command sources .bashrc to make sure the environment variables are set correctly. /home/ubuntu/ is just the user home directory for the user ubuntu. Change this appropriately.
    • The above command also activates the virtualenv so that the manage.py command can run with all the correct dependencies. Adapt the path to your virtualenv.
    • Finally you need to make sure the correct Django settings are activated, either by having DJANGO_SETTINGS_MODULE environment variable set (which you can do in .bashrc hence the source earlier) or by passing the --settings path.to.settings option to manage.py.
    • The last part is directing the output of the task to a log file, so you can troubleshoot if there are issues. Please also add 2>&1 at the end so that cron errors (stderr) are also directed to that same log.

    To check your crontab, run crontab -l (for the currently logged in user) or crontab -l -u user for a different user.