Search code examples
amazon-web-serviceshiveamazon-athenaprestoamazon-elb

AWS Athena alb logs: getting max hits per minute to the request url each day


I am trying to get the max hits (throughput) per minute to the request URLs for each day from the alb logs. I used the table projections to partition the table. Trying to figure out the query to get the desired result for max hits per minute for all the urls for last 1-3 years. The result should look like this (just a sample, the timestamp can be in any format)

Timestamp Url Max Hits Per Min
12-29-2019 8:01 AM url1 10720
12-29-2019 10:35 AM url2 21329
12-29-2019 10:35 AM url3 37420
12-30-2019 11:53 AM url1 5898
12-30-2019 01:30 PM url2 14230
12-30-2019 05:19 PM url3 20000

The table creation query:

CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (
        type string,
        time string,
        elb string,
        client_ip string,
        client_port int,
        target_ip string,
        target_port int,
        request_processing_time double,
        target_processing_time double,
        response_processing_time double,
        elb_status_code string,
        target_status_code string,
        received_bytes bigint,
        sent_bytes bigint,
        request_verb string,
        request_url string,
        request_proto string,
        user_agent string,
        ssl_cipher string,
        ssl_protocol string,
        target_group_arn string,
        trace_id string,
        domain_name string,
        chosen_cert_arn string,
        matched_rule_priority string,
        request_creation_time string,
        actions_executed string,
        redirect_url string,
        lambda_error_reason string,
        target_port_list string,
        target_status_code_list string,
        classification string,
        classification_reason string
        )
        PARTITIONED BY ( `partition_date` string)
        ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
        WITH SERDEPROPERTIES (
        'serialization.format' = '1',
        'input.regex' = 
    '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"')
        LOCATION 's3://your-alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/';
        TBLPROPERTIES ('projection.enabled'='true', 
        'projection.partition_date.format'='yyyy/MM/dd', 
        'projection.partition_date.interval'='1', 
        'projection.partition_date.interval.unit'='DAYS', 
        'projection.partition_date.range'='2018/01/01,NOW', 
        'projection.partition_date.type'='date', 
        'storage.location.template'='s3://your-alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/${partition_date}')

Solution

  • You can try:

    with cte as (
       select date_trunc('minute',timestamp) as minute, url, count(*) as hits_per_minute from mytable
    group by 1,2
    )
    select max_by(minute, hits_per_minute) as timestamp, url, max(hits_per_minute) from cte
    group by date_trunc('day', minute), url
    

    Explanation: The common table expression (cte) will calculate the number of hits per minute per url, from which you then extract the minute where the max hits is reached (using the max_by function) and the max hits, grouped by day and url.

    See the docs for: