My goal is to query only yesterdays data with Impala using a unix timestamp field. I dont want to hard code the date because I want this script to run daily and only query the previous day. I am using python and have created strings for the start and end times.
The endtime is stored as a bigint and looks like the following: 1561996779000
.
yesterday = dt.date.fromordinal(dt.date.today().toordinal()-1).strftime("%F")
yesterday_start = yesterday + ' 00:00:00'
yesterday_end = yesterday + ' 23:59:59'
yesterday_start
'2019-07-28 00:00:00'
yesterday_end
'2019-07-28 23:59:59'
I have tried the following but none of these seem to work:
cursor.execute('select sourceaddress, sourcehostname, sourceusername, endtime from proxy where endtime between unix_timestamp("+yesterday_start+") and unix_timestamp("+yesterday_end+")')
cursor.execute("select sourceaddress, sourcehostname, sourceusername, endtime from proxy where endtime between unix_timestamp("+yesterday_start+") and unix_timestamp("+yesterday_end+")")
cursor.execute("select sourceaddress, sourcehostname, sourceusername, endtime from proxy where endtime between unix_timestamp('yesterday_start') and unix_timestamp('yesterday_end')")
cursor.execute("SELECT * from proxy where endtime between unix_timestamp('"+yesterday_start+"') and unix_timestamp('"+yesterday_end+"')")
Here is an example from the Impala documentation:
select unix_timestamp('2015-05-15 12:00:00');
+---------------------------------------+
| unix_timestamp('2015-05-15 12:00:00') |
+---------------------------------------+
| 1431691200 |
+---------------------------------------+
Still looking for a better way to accomplish this. This works though.
#Date pattern
date_pattern = '%Y-%m-%d %H:%M:%S'
#Yesterday system date
yesterday = dt.date.fromordinal(dt.date.today().toordinal()-1).strftime("%F")
#Start datetime
yesterday_start = yesterday + ' 00:00:00'
yesterday_start_epoch = int(time.mktime(time.strptime(yesterday_start, date_pattern)))
yesterday_start_epoch_str = str(yesterday_start_epoch)
#End datetime
yesterday_end = yesterday + ' 23:59:59'
yesterday_end_epoch = int(time.mktime(time.strptime(yesterday_end, date_pattern)))
yesterday_end_epoch_str = str(yesterday_end_epoch)
#Start timer
start_time = timeit.default_timer()
#Connection and query
IMPALA_HOST = os.getenv('HOST', 'server')
conn = connect(host=HOST, port=port, auth_mechanism='', use_ssl=True)
cursor = conn.cursor()
cursor.execute('SHOW TABLES')
tables = as_pandas(cursor)
cursor.execute("select sourceaddress, sourcehostname, sourceusername, endtime from proxy where endtime between cast('"+yesterday_start_epoch_str+"' AS INT) and cast('"+yesterday_end_epoch_str+"' AS INT)")
df = as_pandas(cursor)
#End timer
end_time = timeit.default_timer()
#Print time it took
print("Elapsed time: {}".format(end_time - start_time))