Search code examples
pythonpysparkcron

Running PySpark using Cronjob (crontab) not working


I'm trying to submit my pyspark code through cron job. When I run manually, its working fine. Through cron its not working.

Here is the project structure I have:

my-project
|
|--src
|----jobs
|------execute_metrics.py
|----utils
|------get_spark_session.py

The main code lies in execute_metrics.py from src/jobs. I'm using get_spark_session.py in execute_metrics.py using from src.utils import get_spark_session.

I created a shell script execute_metric.sh with below content for executing the cron job

#!/bin/bash
PATH=<included entire path here>
spark-submit <included required options> src/jobs/execute_metrics.py
my-project
|
|--src
|----jobs
|------execute_metrics.py
|----utils
|------get_spark_session.py
|--execute_metric.sh

When I run this shell script using ./execute_metric.sh, I'm able to see the results.

Now, I need this to run the job every minute. So, I created a cron file with below content and copied in the same directory

* * * * * ./execute_metric.sh > execute_metric_log.log

my-project
|
|--src
|----jobs
|------execute_metrics.py
|----utils
|------get_spark_session.py
|--execute_metric.sh
|--execute_cron.crontab

This cron is running for every minute, but giving me the error: ModuleNotFoundError: No module named 'src'

Can someone please tell me what went wrong here?

Thanks in advance


Solution

  • I got it fixed by adding a main.py file in the project directory and changed my cron to execute main.py. The project structure now looks like:

    my-project
    |
    |--src
    |----jobs
    |------execute_metrics.py
    |----utils
    |------get_spark_session.py
    |--execute_metric.sh
    |--execute_cron.crontab
    |--main.py
    

    In main.py, I'm invoking the functions of execute_metrics.py.