Search code examples
pythonimpala

How to use a python variable in an Impala query to find previous day using epoch time?


My goal is to query only yesterdays data with Impala using a unix timestamp field. I dont want to hard code the date because I want this script to run daily and only query the previous day. I am using python and have created strings for the start and end times.

The endtime is stored as a bigint and looks like the following: 1561996779000.

yesterday = dt.date.fromordinal(dt.date.today().toordinal()-1).strftime("%F")
yesterday_start = yesterday + ' 00:00:00'
yesterday_end = yesterday + ' 23:59:59'

yesterday_start
'2019-07-28 00:00:00'
yesterday_end
'2019-07-28 23:59:59'

I have tried the following but none of these seem to work:

cursor.execute('select sourceaddress, sourcehostname, sourceusername, endtime from proxy where endtime between unix_timestamp("+yesterday_start+") and unix_timestamp("+yesterday_end+")')
cursor.execute("select sourceaddress, sourcehostname, sourceusername, endtime from proxy where endtime between unix_timestamp("+yesterday_start+") and unix_timestamp("+yesterday_end+")")
cursor.execute("select sourceaddress, sourcehostname, sourceusername, endtime from proxy where endtime between unix_timestamp('yesterday_start') and unix_timestamp('yesterday_end')")
cursor.execute("SELECT * from proxy where endtime between unix_timestamp('"+yesterday_start+"') and unix_timestamp('"+yesterday_end+"')")

Here is an example from the Impala documentation:

select unix_timestamp('2015-05-15 12:00:00');
+---------------------------------------+
| unix_timestamp('2015-05-15 12:00:00') |
+---------------------------------------+
| 1431691200                            |
+---------------------------------------+

Solution

  • Still looking for a better way to accomplish this. This works though.

    #Date pattern
    date_pattern = '%Y-%m-%d %H:%M:%S'
    #Yesterday system date
    yesterday = dt.date.fromordinal(dt.date.today().toordinal()-1).strftime("%F")
    #Start datetime
    yesterday_start = yesterday + ' 00:00:00'
    yesterday_start_epoch = int(time.mktime(time.strptime(yesterday_start, date_pattern)))
    yesterday_start_epoch_str = str(yesterday_start_epoch)
    #End datetime 
    yesterday_end = yesterday + ' 23:59:59'
    yesterday_end_epoch = int(time.mktime(time.strptime(yesterday_end, date_pattern)))
    yesterday_end_epoch_str = str(yesterday_end_epoch)
    
    #Start timer
    start_time = timeit.default_timer()
    #Connection and query
    IMPALA_HOST = os.getenv('HOST', 'server')
    conn = connect(host=HOST, port=port, auth_mechanism='', use_ssl=True)
    cursor = conn.cursor()
    cursor.execute('SHOW TABLES')
    tables = as_pandas(cursor)
    cursor.execute("select sourceaddress, sourcehostname, sourceusername, endtime from proxy where endtime between cast('"+yesterday_start_epoch_str+"' AS INT) and cast('"+yesterday_end_epoch_str+"' AS INT)")
    df = as_pandas(cursor)
    #End timer
    end_time = timeit.default_timer()
    #Print time it took
    print("Elapsed time: {}".format(end_time - start_time))