Search code examples
python-3.xlinuxcroncondaubuntu-server

schedule python script with crontab


I have a bash script that I am trying to run with a cron job. I'm trying to run the cron job on my ubuntu server. I want it to run everyday at 8 hr utc. the bash script activates a conda python virtual environment and runs a python script. the script is supposed to pull data and load it into a mysql database. I also have logging throughout the python script. no new data showed up in the database last night and no new logs were created. below I've shown what's in the crontab and what's in the stocks_etl.sh script. does anyone see what the issue might be, and how to fix it?

sudo crontab -e

crontab shows

0 8 * * * /mnt/data/sda/user_storage/stocks_etl.sh

stocks_etl.sh

#!/bin/bash
source activate py36
python /mnt/data/sda/user_storage/stocks_etl.py

update #3:

when I run this command in the command line on my ubuntu server it works fine

bash ~/etl_scripts/stocks_etl.bashrc

when I run it in crontab using the same user, it throws the error below

error:

Started stocks_etl.bash
Thu Feb 25 05:20:01 UTC 2021
/home/user/etl_scripts/stocks_etl.bashrc: line 5: activate: No such file or directory
Traceback (most recent call last):
  File "/home/user/etl_scripts/stocks_etl.py", line 4, in <module>
    import numpy as np
ImportError: No module named numpy

here's the bashrc file:

#!/bin/bash -l
echo 'Started stocks_etl.bash'
date +'%a %b %e %H:%M:%S %Z %Y'


source activate py36
python ~/etl_scripts/stocks_etl.py

it's like when I run it in crontab it can't find conda and it's just running it with the base python installation that doesn't have numpy installed. does anyone see what the issue might be and can you suggest how to solve it?

update #2: now that I've run chmod 777 on the files, when the crontab executes I'm getting the error below. it's like the conda virtual env isn't being activated and it's just trying to run it with the base python installation

error:

/mnt/data/sda/user_storage/etl_scripts/stocks_etl.sh: line 2: activate: No such file or directory
Traceback (most recent call last):
  File "/mnt/data/sda/user_storage/etl_scripts/stocks_etl.py", line 1, in <module>
    import numpy as np
ImportError: No module named numpy

update:

stocks_etl.py

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from yahoofinancials import YahooFinancials

import pymysql

import datetime
import logging

import time

import glob

from sqlalchemy import create_engine

import os

import datetime


# helper functions



# function for creating error logs
# Note: function not currently working, doesn't recognize logger

def error_logger(path):
    
    # adding a timestamp to logname
    ts=str(datetime.datetime.now().isoformat())
    
    # logging.basicConfig(filename='example.log',level=logging.DEBUG)
    logging.basicConfig(filename=path+ts+'.log', level=logging.DEBUG, 
                        format='%(asctime)s %(levelname)s %(name)s %(message)s')

    logger=logging.getLogger(__name__)


# function to query mysql db and return dataframe of results
def mysql_query(user,password,database,host,query):
    
    connection = pymysql.connect(user=user, password=password, database=database, host=host)


    try:
        with connection.cursor() as cursor:
            query = query


        df = pd.read_sql(query, connection)
        
        logging.info('query succeeded: '+query)
        
#     finally:
        connection.close()
        
        logging.info('close connection mysql')

    except Exception as err:
        
        logger.error('query failed: '+query+' got error: '+str(err))
        
        return df
        
    pass

    
        
    


# function to download OHLC stock data

def download_stocks(Ticker_list,start_date,end_date,time_interval,path):
    
    
    # get data for stocks in Ticker_list and save as csv

    failed_list=[]
    passed_list=[]

    Ticker_list = Ticker_list

    for x in range(len(Ticker_list)):


        try:

            yahoo_financials = YahooFinancials(Ticker_list[x])
            # data = yahoo_financials.get_historical_price_data('2019-01-01', '2019-09-30', time_interval='daily')
            data = yahoo_financials.get_historical_price_data(start_date, end_date, time_interval=time_interval)

            prices_df=pd.DataFrame(data[Ticker_list[x]]['prices'])

            prices_df=prices_df[['adjclose', 'close', 'formatted_date', 'high', 'low', 'open',
                   'volume']]

            prices_df['date']=prices_df['formatted_date']

            prices_df=prices_df[['date','adjclose', 'close', 'high', 'low', 'open',
                   'volume']]

            prices_df['Ticker']=Ticker_list[x]

            prices_df.to_csv(path+Ticker_list[x]+'.csv')

            passed_list.append(Ticker_list[x])

            logging.info('downloaded: '+Ticker_list[x])

            time.sleep(1)

        except Exception as err:

            failed_list.append(Ticker_list[x])
            logger.error('tried download: '+Ticker_list[x]+' got error: '+str(err))

        pass
        

# function read csv in and append to one dataframe

def stock_dataframe(path):    

    try:
        path = path
        all_files = glob.glob(path + "/*.csv")

        li = []

        for filename in all_files:
            df = pd.read_csv(filename, index_col=None, header=0)
            li.append(df)

        frame = pd.concat(li, axis=0, ignore_index=True)

        frame=frame[['date', 'adjclose', 'close', 'high', 'low', 'open',
               'volume', 'Ticker']]

        return frame
    
        logging.info('created stock dataframe')
        
    except Exception as err:

            logger.error('stock dataframe create failed got error: '+str(err))
            
    pass


# write dataframe to mysql db

def write_dataframe(username, password, host, schema,dataframe,table,if_exists,index):
    
    try:
        
        from sqlalchemy import create_engine
        
        # connection = pymysql.connect(user='user', password='psswd', database='sandbox', host='xxxxx')

        engine = create_engine("mysql+pymysql://"+str(username)+":"+str(password)+"@"+str(host)+"/"+str(schema))
        # engine = create_engine("mysql+mysqldb://user:"+'psswd'+"@xxxxx/sandbox")
        dataframe.to_sql(con=engine, name=table, if_exists=if_exists, index=index)
        
        logging.info('write_dataframe succeeded')
        
    except Exception as err:

            logger.error('write_dataframe failed got error: '+str(err))
            
    pass




# to do

# - create directory with datetime prefix as part of path
# - add step that checks max date in current table
# - only pull data later than max date in current table
# - check max date in current derived table
# - only pull data later than current date from source table


def etl_pipeline(table_var):


    i=table_var

    max_date_query="""select max(date) as max_date from """+i+""""""

    try:
        
        max_date_df=mysql_query(user='user',
                            password='psswd',
                            database='stocks',
                            host='xxxxx',
                            query=max_date_query)
            
        logging.info('max_date succeeded: '+i)
            
    except Exception as err:

            logger.error('max_date failed: '+i)

    pass
        


    # In[8]:

    try:
        # get max date
        max_date=max_date_df.astype(str)['max_date'][0]


        # create directory

        base_path='/mnt/data/sda/user_storage/stock_data_downloads/'

        # get current_date
        current_date=datetime.datetime.today().strftime('%Y-%m-%d')

        directory_path=base_path+i+'/'+current_date

        # create directory for downloading new stocks in to
        os.mkdir(directory_path)

        logging.info('create directory succeeded: '+i)

    except Exception as err:

            logger.error('create directory failed: '+i)

    pass


    # In[9]:


    # getting ticker symbols

    ticker_query="""select distinct ticker as ticker from """+i+""""""

    try:
        
        tickers_df=mysql_query(user='user',
                            password='psswd',
                            database='stocks',
                            host='xxxxx',
                            query=ticker_query)
            
        logging.info('get tickers succeeded: '+i)
            
    except Exception as err:

            logger.error('get tickers failed: '+i)

    pass


    # In[12]:


    # get ticker symbols 
    stocks=tickers_df.ticker.tolist()


    # download stocks
    # Note: must add '/' to end of path
    # '2019-01-01', '2021-01-01', time_interval='daily'
    download_stocks(Ticker_list=stocks,
                    start_date=max_date,
                    end_date=current_date,
                    time_interval='daily',
                    path=directory_path+'/')


    # In[70]:


    # directory_path


    # In[13]:


    # create dataframe
    stocks_df=stock_dataframe(path=directory_path)

    # trav_stocks_df.head()


    # In[14]:





    # create mysql table
    write_dataframe(username='user', 
                    password='psswd', 
                    host='xxxxx', 
                    schema='stocks',
                    dataframe=stocks_df,
                    table=i,
                    if_exists='append',
                    index=False)


    # In[15]:


    # creating additional avg annual returns

    try:
        
        query="""select ticker, avg(annual_returns) as avg_annual_returns from (
        select ticker,date, ( -1 +
                a.adjclose / max(a.adjclose) over (partition by ticker 
                                             order by date
                                             range between interval 365 day preceding and interval 365 day preceding
                                            ) 
               ) as annual_returns              
        from """+i+""" a
        ) b where annual_returns is not null
        group by ticker"""

        df=mysql_query(user='user',password='psswd',database='stocks',host='xxxxx',query=query)

        logging.info('etl succeeded: '+i+'_returns')

    except Exception as err:

            logger.error('etl failed: '+i+'_returns')

    pass


    # In[16]:


    # adding additional avg annual returns to table

    # create mysql table
    write_dataframe(username='user', 
                    password='psswd', 
                    host='xxxxx', 
                    schema='stocks',
                    dataframe=df,
                    table=i+'_returns',
                    if_exists='replace',
                    index=False)
    
    
# start logging

# adding a timestamp to logname
ts=str(datetime.datetime.now().isoformat())  

# logging.basicConfig(filename='example.log',level=logging.DEBUG)
logging.basicConfig(filename='/mnt/data/sda/user_storage/logs/etl_scripts/'+ts+'.log', level=logging.DEBUG, 
                    format='%(asctime)s %(levelname)s %(name)s %(message)s')

logger=logging.getLogger(__name__)


    
table_list=['trav_stocks','s_and_p','american_mutual_funds']

for j in table_list:
    
    try:
        
        etl_pipeline(j)
        
        logging.info('etl_pipeline succeeded: '+j)
        
    except Exception as err:

            logger.error('etl_pipeline failed: '+j)

    pass

update:

I changed my file to a .bash file and the code inside to

#!/bin/bash -l
echo ''
'Started stocks_etl.bash'
date +'%a %b %e %H:%M:%S %Z %Y'


source /home/user/anaconda3/envs/py36/bin/activate 
conda activate py36
python ~/etl_scripts/stocks_etl.py

now I'm getting the error below when run in crontab

error:

/home/user/etl_scripts/stocks_etl.bash: line 3: Started stocks_etl.bash: command not found
Fri Feb 26 16:28:01 UTC 2021
/home/user/etl_scripts/stocks_etl.bash: line 7: /home/user/anaconda3/envs/py36/bin/activate: No such file or directory
/home/user/etl_scripts/stocks_etl.bash: line 8: conda: command not found
Traceback (most recent call last):
  File "/home/user/etl_scripts/stocks_etl.py", line 4, in <module>
    import numpy as np
ImportError: No module named numpy

update:

code:

#!/bin/bash
echo ''
'Started stocks_etl.bash'
date +'%a %b %e %H:%M:%S %Z %Y'


/home/user/anaconda3 run -n py36 python ~/user/etl_scripts/stocks_etl.py

error:

/home/user/etl_scripts/stocks_etl.bash: line 3: Started stocks_etl.bash: command not found
Fri Feb 26 16:43:01 UTC 2021
/home/user/etl_scripts/stocks_etl.bash: line 7: /home/user/anaconda3: Is a directory

Solution

  • First, source activate syntax was deprecated years ago (how old is your Conda instance?) - you should be using conda activate. Second, Conda shell commands are loaded into the shell as part of sourcing .bashrc or .bash_profile. So at the very least, you need to include the -l in the shebang and

    #!/bin/bash -l
    conda activate py36
    python /mnt/data/sda/user_storage/stocks_etl.py
    

    You may need to do something extra to ensure the .bashrc it sources is correct (e.g., what user does it run as?).

    Note that Conda also has the conda run command for executing commands inside envs, which I think should be preferred:

    #!/bin/bash -l
    conda run -n py36 python /mnt/data/sda/user_storage/stocks_etl.py
    

    The latter form should also work without the Conda initialization, but providing the full path to the conda entry point:

    #!/bin/bash
    
    # change to match where your `conda` location
    /home/user/anaconda3/condabin/conda run -n py36 python /mnt/data/sda/user_storage/stocks_etl.py