Search code examples
sqlamazon-cloudfrontamazon-athena

Athena query on CloudFront Logs for max rate during 5 minute interval


I'm trying to query CloudFront Logs to derive a traffic baseline for the max rate of requests by IP address. I'm struggling to combine the date and time columns into a timestamp I can use to generate the 5 minute intervals. Am I even close with this query?

SELECT from_iso8601_timestamp(CONCAT(CAST(date as varchar), 'T', time, 'Z')) AS datetime, request_ip, COUNT(request_ip) AS count
FROM cloudfront_logs
WHERE "date"
  BETWEEN DATE '2020-10-28'
      AND DATE '2020-11-04'
GROUP BY CONCAT(CAST(date as varchar), 'T', time, 'Z'), request_ip
ORDER BY datetime ASC
LIMIT 10;

I'm trying to get to a point where I can do something like this:

SELECT
  FROM_UNIXTIME((UNIX_TIMESTAMP(table.timestamp) DIV 300) * 300) AS Timestamp,
  request_ip,
  COUNT(request_ip) AS count
FROM cloudfront_logs
WHERE "date"
  BETWEEN DATE '2020-10-28'
      AND DATE '2020-11-04'
GROUP BY CONCAT(CAST(date as varchar), 'T', time, 'Z'), request_ip, count
ORDER BY datetime ASC
LIMIT 10;

Per request, here's the format of the cloudfront_logs table based on the log format. I have tried to create the concatenated timestamp during table creation, but I had a lot of syntax errors:

CREATE EXTERNAL TABLE IF NOT EXISTS default.cloudfront_logs (
  `date` DATE,
  time STRING,
  location STRING,
  bytes BIGINT,
  request_ip STRING,
  method STRING,
  host STRING,
  uri STRING,
  status INT,
  referrer STRING,
  user_agent STRING,
  query_string STRING,
  cookie STRING,
  result_type STRING,
  request_id STRING,
  host_header STRING,
  request_protocol STRING,
  request_bytes BIGINT,
  time_taken FLOAT,
  xforwarded_for STRING,
  ssl_protocol STRING,
  ssl_cipher STRING,
  response_result_type STRING,
  http_version STRING,
  fle_status STRING,
  fle_encrypted_fields INT,
  c_port INT,
  time_to_first_byte FLOAT,
  x_edge_detailed_result_type STRING,
  sc_content_type STRING,
  sc_content_len BIGINT,
  sc_range_start BIGINT,
  sc_range_end BIGINT
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
LOCATION 's3://CloudFront_bucket_name/CloudFront/'
TBLPROPERTIES ( 'skip.header.line.count'='2' )

Taken from the official AWS Documentation on Querying Amazon CloudFront Logs

Failed example:

CREATE EXTERNAL TABLE IF NOT EXISTS default.cloudfront_logs (
  timestamp AS from_iso8601_timestamp(CONCAT(CAST(date as varchar), 'T', time, 'Z')) DATE,
  `date` DATE,
  time STRING,
  location STRING,
  bytes BIGINT,

Syntax error:

line 1:8: no viable alternative at input 'create external' (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: e5147782-2e62-465e...)

UPDATE:

Here is a sample from the actual logs:

#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type cs-protocol-version fle-status fle-encrypted-fields c-port time-to-first-byte x-edge-detailed-result-type sc-content-type sc-content-len sc-range-start sc-range-end
2020-10-29      04:59:03        MEL50-C1        1524    192.168.0.2     GET     example.cloudfront.net    /mobile/js/category/category.js 200     https://www.example.com/    Mozilla/5.0%20(Linux;%20Android%2010;%20A7)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/192.168.0.1%20Mobile%20Safari/537.36      v=2020102601    cookie=redacted Hit     BSVYL08_EXAMPLE==        cdn.example.com      https   2001    0.005   -       TLSv1.3 TLS_AES_128_GCM_SHA256  Hit     HTTP/2.0
        -       -       26906   0.004   Hit     application/javascript  787     -       -

SOLUTION:

SELECT from_unixtime(floor(to_unixtime(from_iso8601_timestamp(CONCAT(CAST(date AS VARCHAR), 'T', time, 'Z'))) / 300) * 300), request_ip, COUNT(request_ip) AS count
FROM cloudfront_logs
WHERE "date"
  BETWEEN DATE '2020-10-28'
      AND DATE '2020-11-04'
GROUP BY from_unixtime(floor(to_unixtime(from_iso8601_timestamp(CONCAT(CAST(date AS VARCHAR), 'T', time, 'Z'))) / 300) * 300), request_ip
ORDER BY count DESC;

Solution

  • Maybe there's a shorter way to express it, but you should be able to use the following expression to round your date + time to 5 minute intervals:

    from_unixtime(floor(to_unixtime(from_iso8601_timestamp(CONCAT(CAST(date AS VARCHAR), 'T', time, 'Z'))) / 300) * 300)
    

    If you use that in your first query you should get what you're after.