I have a bash script that I am trying to run with a cron job. I'm trying to run the cron job on my ubuntu server. I want it to run everyday at 8 hr utc. the bash script activates a conda python virtual environment and runs a python script. the script is supposed to pull data and load it into a mysql database. I also have logging throughout the python script. no new data showed up in the database last night and no new logs were created. below I've shown what's in the crontab and what's in the stocks_etl.sh script. does anyone see what the issue might be, and how to fix it?
sudo crontab -e
crontab shows
0 8 * * * /mnt/data/sda/user_storage/stocks_etl.sh
stocks_etl.sh
#!/bin/bash
source activate py36
python /mnt/data/sda/user_storage/stocks_etl.py
update #3:
when I run this command in the command line on my ubuntu server it works fine
bash ~/etl_scripts/stocks_etl.bashrc
when I run it in crontab using the same user, it throws the error below
error:
Started stocks_etl.bash
Thu Feb 25 05:20:01 UTC 2021
/home/user/etl_scripts/stocks_etl.bashrc: line 5: activate: No such file or directory
Traceback (most recent call last):
File "/home/user/etl_scripts/stocks_etl.py", line 4, in <module>
import numpy as np
ImportError: No module named numpy
here's the bashrc file:
#!/bin/bash -l
echo 'Started stocks_etl.bash'
date +'%a %b %e %H:%M:%S %Z %Y'
source activate py36
python ~/etl_scripts/stocks_etl.py
it's like when I run it in crontab it can't find conda and it's just running it with the base python installation that doesn't have numpy installed. does anyone see what the issue might be and can you suggest how to solve it?
update #2: now that I've run chmod 777 on the files, when the crontab executes I'm getting the error below. it's like the conda virtual env isn't being activated and it's just trying to run it with the base python installation
error:
/mnt/data/sda/user_storage/etl_scripts/stocks_etl.sh: line 2: activate: No such file or directory
Traceback (most recent call last):
File "/mnt/data/sda/user_storage/etl_scripts/stocks_etl.py", line 1, in <module>
import numpy as np
ImportError: No module named numpy
update:
stocks_etl.py
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from yahoofinancials import YahooFinancials
import pymysql
import datetime
import logging
import time
import glob
from sqlalchemy import create_engine
import os
import datetime
# helper functions
# function for creating error logs
# Note: function not currently working, doesn't recognize logger
def error_logger(path):
# adding a timestamp to logname
ts=str(datetime.datetime.now().isoformat())
# logging.basicConfig(filename='example.log',level=logging.DEBUG)
logging.basicConfig(filename=path+ts+'.log', level=logging.DEBUG,
format='%(asctime)s %(levelname)s %(name)s %(message)s')
logger=logging.getLogger(__name__)
# function to query mysql db and return dataframe of results
def mysql_query(user,password,database,host,query):
connection = pymysql.connect(user=user, password=password, database=database, host=host)
try:
with connection.cursor() as cursor:
query = query
df = pd.read_sql(query, connection)
logging.info('query succeeded: '+query)
# finally:
connection.close()
logging.info('close connection mysql')
except Exception as err:
logger.error('query failed: '+query+' got error: '+str(err))
return df
pass
# function to download OHLC stock data
def download_stocks(Ticker_list,start_date,end_date,time_interval,path):
# get data for stocks in Ticker_list and save as csv
failed_list=[]
passed_list=[]
Ticker_list = Ticker_list
for x in range(len(Ticker_list)):
try:
yahoo_financials = YahooFinancials(Ticker_list[x])
# data = yahoo_financials.get_historical_price_data('2019-01-01', '2019-09-30', time_interval='daily')
data = yahoo_financials.get_historical_price_data(start_date, end_date, time_interval=time_interval)
prices_df=pd.DataFrame(data[Ticker_list[x]]['prices'])
prices_df=prices_df[['adjclose', 'close', 'formatted_date', 'high', 'low', 'open',
'volume']]
prices_df['date']=prices_df['formatted_date']
prices_df=prices_df[['date','adjclose', 'close', 'high', 'low', 'open',
'volume']]
prices_df['Ticker']=Ticker_list[x]
prices_df.to_csv(path+Ticker_list[x]+'.csv')
passed_list.append(Ticker_list[x])
logging.info('downloaded: '+Ticker_list[x])
time.sleep(1)
except Exception as err:
failed_list.append(Ticker_list[x])
logger.error('tried download: '+Ticker_list[x]+' got error: '+str(err))
pass
# function read csv in and append to one dataframe
def stock_dataframe(path):
try:
path = path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
frame=frame[['date', 'adjclose', 'close', 'high', 'low', 'open',
'volume', 'Ticker']]
return frame
logging.info('created stock dataframe')
except Exception as err:
logger.error('stock dataframe create failed got error: '+str(err))
pass
# write dataframe to mysql db
def write_dataframe(username, password, host, schema,dataframe,table,if_exists,index):
try:
from sqlalchemy import create_engine
# connection = pymysql.connect(user='user', password='psswd', database='sandbox', host='xxxxx')
engine = create_engine("mysql+pymysql://"+str(username)+":"+str(password)+"@"+str(host)+"/"+str(schema))
# engine = create_engine("mysql+mysqldb://user:"+'psswd'+"@xxxxx/sandbox")
dataframe.to_sql(con=engine, name=table, if_exists=if_exists, index=index)
logging.info('write_dataframe succeeded')
except Exception as err:
logger.error('write_dataframe failed got error: '+str(err))
pass
# to do
# - create directory with datetime prefix as part of path
# - add step that checks max date in current table
# - only pull data later than max date in current table
# - check max date in current derived table
# - only pull data later than current date from source table
def etl_pipeline(table_var):
i=table_var
max_date_query="""select max(date) as max_date from """+i+""""""
try:
max_date_df=mysql_query(user='user',
password='psswd',
database='stocks',
host='xxxxx',
query=max_date_query)
logging.info('max_date succeeded: '+i)
except Exception as err:
logger.error('max_date failed: '+i)
pass
# In[8]:
try:
# get max date
max_date=max_date_df.astype(str)['max_date'][0]
# create directory
base_path='/mnt/data/sda/user_storage/stock_data_downloads/'
# get current_date
current_date=datetime.datetime.today().strftime('%Y-%m-%d')
directory_path=base_path+i+'/'+current_date
# create directory for downloading new stocks in to
os.mkdir(directory_path)
logging.info('create directory succeeded: '+i)
except Exception as err:
logger.error('create directory failed: '+i)
pass
# In[9]:
# getting ticker symbols
ticker_query="""select distinct ticker as ticker from """+i+""""""
try:
tickers_df=mysql_query(user='user',
password='psswd',
database='stocks',
host='xxxxx',
query=ticker_query)
logging.info('get tickers succeeded: '+i)
except Exception as err:
logger.error('get tickers failed: '+i)
pass
# In[12]:
# get ticker symbols
stocks=tickers_df.ticker.tolist()
# download stocks
# Note: must add '/' to end of path
# '2019-01-01', '2021-01-01', time_interval='daily'
download_stocks(Ticker_list=stocks,
start_date=max_date,
end_date=current_date,
time_interval='daily',
path=directory_path+'/')
# In[70]:
# directory_path
# In[13]:
# create dataframe
stocks_df=stock_dataframe(path=directory_path)
# trav_stocks_df.head()
# In[14]:
# create mysql table
write_dataframe(username='user',
password='psswd',
host='xxxxx',
schema='stocks',
dataframe=stocks_df,
table=i,
if_exists='append',
index=False)
# In[15]:
# creating additional avg annual returns
try:
query="""select ticker, avg(annual_returns) as avg_annual_returns from (
select ticker,date, ( -1 +
a.adjclose / max(a.adjclose) over (partition by ticker
order by date
range between interval 365 day preceding and interval 365 day preceding
)
) as annual_returns
from """+i+""" a
) b where annual_returns is not null
group by ticker"""
df=mysql_query(user='user',password='psswd',database='stocks',host='xxxxx',query=query)
logging.info('etl succeeded: '+i+'_returns')
except Exception as err:
logger.error('etl failed: '+i+'_returns')
pass
# In[16]:
# adding additional avg annual returns to table
# create mysql table
write_dataframe(username='user',
password='psswd',
host='xxxxx',
schema='stocks',
dataframe=df,
table=i+'_returns',
if_exists='replace',
index=False)
# start logging
# adding a timestamp to logname
ts=str(datetime.datetime.now().isoformat())
# logging.basicConfig(filename='example.log',level=logging.DEBUG)
logging.basicConfig(filename='/mnt/data/sda/user_storage/logs/etl_scripts/'+ts+'.log', level=logging.DEBUG,
format='%(asctime)s %(levelname)s %(name)s %(message)s')
logger=logging.getLogger(__name__)
table_list=['trav_stocks','s_and_p','american_mutual_funds']
for j in table_list:
try:
etl_pipeline(j)
logging.info('etl_pipeline succeeded: '+j)
except Exception as err:
logger.error('etl_pipeline failed: '+j)
pass
update:
I changed my file to a .bash file and the code inside to
#!/bin/bash -l
echo ''
'Started stocks_etl.bash'
date +'%a %b %e %H:%M:%S %Z %Y'
source /home/user/anaconda3/envs/py36/bin/activate
conda activate py36
python ~/etl_scripts/stocks_etl.py
now I'm getting the error below when run in crontab
error:
/home/user/etl_scripts/stocks_etl.bash: line 3: Started stocks_etl.bash: command not found
Fri Feb 26 16:28:01 UTC 2021
/home/user/etl_scripts/stocks_etl.bash: line 7: /home/user/anaconda3/envs/py36/bin/activate: No such file or directory
/home/user/etl_scripts/stocks_etl.bash: line 8: conda: command not found
Traceback (most recent call last):
File "/home/user/etl_scripts/stocks_etl.py", line 4, in <module>
import numpy as np
ImportError: No module named numpy
update:
code:
#!/bin/bash
echo ''
'Started stocks_etl.bash'
date +'%a %b %e %H:%M:%S %Z %Y'
/home/user/anaconda3 run -n py36 python ~/user/etl_scripts/stocks_etl.py
error:
/home/user/etl_scripts/stocks_etl.bash: line 3: Started stocks_etl.bash: command not found
Fri Feb 26 16:43:01 UTC 2021
/home/user/etl_scripts/stocks_etl.bash: line 7: /home/user/anaconda3: Is a directory
First, source activate
syntax was deprecated years ago (how old is your Conda instance?) - you should be using conda activate
. Second, Conda shell commands are loaded into the shell as part of sourcing .bashrc
or .bash_profile
. So at the very least, you need to include the -l
in the shebang and
#!/bin/bash -l
conda activate py36
python /mnt/data/sda/user_storage/stocks_etl.py
You may need to do something extra to ensure the .bashrc
it sources is correct (e.g., what user does it run as?).
Note that Conda also has the conda run
command for executing commands inside envs, which I think should be preferred:
#!/bin/bash -l
conda run -n py36 python /mnt/data/sda/user_storage/stocks_etl.py
The latter form should also work without the Conda initialization, but providing the full path to the conda
entry point:
#!/bin/bash
# change to match where your `conda` location
/home/user/anaconda3/condabin/conda run -n py36 python /mnt/data/sda/user_storage/stocks_etl.py