I'm trying to query CloudFront Logs to derive a traffic baseline for the max rate of requests by IP address. I'm struggling to combine the date and time columns into a timestamp I can use to generate the 5 minute intervals. Am I even close with this query?
SELECT from_iso8601_timestamp(CONCAT(CAST(date as varchar), 'T', time, 'Z')) AS datetime, request_ip, COUNT(request_ip) AS count
FROM cloudfront_logs
WHERE "date"
BETWEEN DATE '2020-10-28'
AND DATE '2020-11-04'
GROUP BY CONCAT(CAST(date as varchar), 'T', time, 'Z'), request_ip
ORDER BY datetime ASC
LIMIT 10;
I'm trying to get to a point where I can do something like this:
SELECT
FROM_UNIXTIME((UNIX_TIMESTAMP(table.timestamp) DIV 300) * 300) AS Timestamp,
request_ip,
COUNT(request_ip) AS count
FROM cloudfront_logs
WHERE "date"
BETWEEN DATE '2020-10-28'
AND DATE '2020-11-04'
GROUP BY CONCAT(CAST(date as varchar), 'T', time, 'Z'), request_ip, count
ORDER BY datetime ASC
LIMIT 10;
Per request, here's the format of the cloudfront_logs table based on the log format. I have tried to create the concatenated timestamp during table creation, but I had a lot of syntax errors:
CREATE EXTERNAL TABLE IF NOT EXISTS default.cloudfront_logs (
`date` DATE,
time STRING,
location STRING,
bytes BIGINT,
request_ip STRING,
method STRING,
host STRING,
uri STRING,
status INT,
referrer STRING,
user_agent STRING,
query_string STRING,
cookie STRING,
result_type STRING,
request_id STRING,
host_header STRING,
request_protocol STRING,
request_bytes BIGINT,
time_taken FLOAT,
xforwarded_for STRING,
ssl_protocol STRING,
ssl_cipher STRING,
response_result_type STRING,
http_version STRING,
fle_status STRING,
fle_encrypted_fields INT,
c_port INT,
time_to_first_byte FLOAT,
x_edge_detailed_result_type STRING,
sc_content_type STRING,
sc_content_len BIGINT,
sc_range_start BIGINT,
sc_range_end BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://CloudFront_bucket_name/CloudFront/'
TBLPROPERTIES ( 'skip.header.line.count'='2' )
Taken from the official AWS Documentation on Querying Amazon CloudFront Logs
Failed example:
CREATE EXTERNAL TABLE IF NOT EXISTS default.cloudfront_logs (
timestamp AS from_iso8601_timestamp(CONCAT(CAST(date as varchar), 'T', time, 'Z')) DATE,
`date` DATE,
time STRING,
location STRING,
bytes BIGINT,
Syntax error:
line 1:8: no viable alternative at input 'create external' (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: e5147782-2e62-465e...)
UPDATE:
Here is a sample from the actual logs:
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type cs-protocol-version fle-status fle-encrypted-fields c-port time-to-first-byte x-edge-detailed-result-type sc-content-type sc-content-len sc-range-start sc-range-end
2020-10-29 04:59:03 MEL50-C1 1524 192.168.0.2 GET example.cloudfront.net /mobile/js/category/category.js 200 https://www.example.com/ Mozilla/5.0%20(Linux;%20Android%2010;%20A7)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/192.168.0.1%20Mobile%20Safari/537.36 v=2020102601 cookie=redacted Hit BSVYL08_EXAMPLE== cdn.example.com https 2001 0.005 - TLSv1.3 TLS_AES_128_GCM_SHA256 Hit HTTP/2.0
- - 26906 0.004 Hit application/javascript 787 - -
SOLUTION:
SELECT from_unixtime(floor(to_unixtime(from_iso8601_timestamp(CONCAT(CAST(date AS VARCHAR), 'T', time, 'Z'))) / 300) * 300), request_ip, COUNT(request_ip) AS count
FROM cloudfront_logs
WHERE "date"
BETWEEN DATE '2020-10-28'
AND DATE '2020-11-04'
GROUP BY from_unixtime(floor(to_unixtime(from_iso8601_timestamp(CONCAT(CAST(date AS VARCHAR), 'T', time, 'Z'))) / 300) * 300), request_ip
ORDER BY count DESC;
Maybe there's a shorter way to express it, but you should be able to use the following expression to round your date + time to 5 minute intervals:
from_unixtime(floor(to_unixtime(from_iso8601_timestamp(CONCAT(CAST(date AS VARCHAR), 'T', time, 'Z'))) / 300) * 300)
If you use that in your first query you should get what you're after.