I am doing some practice on weblog parsing and here is a question on regex:
The log file is in the format of:
in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839
I need to get the timestamp, here is what I have now:
regexp_extract('value', r'((\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}))', 1).alias('timestamp'),
This returns me:
01/Aug/1995:00:00:01 -0400
My question is what does -0400 means? time zone? How do I remove it?
Yes - that's a timezone.
You can simply remove it by eliminating -\d{4}
part of the pattern. So this is what you're looking for:
regexp_extract('value', r'((\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}))', 1).alias('timestamp'),
Also as a explanation:
-
matches a dash plus a space after it literally\d
matches a digit{4}
limits it to only 4 digits