Search code examples
regexpysparkweblog

How to extract timestamp and remove tailing portion from weblog using regex in pyspark?


I am doing some practice on weblog parsing and here is a question on regex:

The log file is in the format of:

in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839

I need to get the timestamp, here is what I have now:

regexp_extract('value', r'((\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}))', 1).alias('timestamp'),

This returns me:

01/Aug/1995:00:00:01 -0400

My question is what does -0400 means? time zone? How do I remove it?


Solution

  • Yes - that's a timezone.

    You can simply remove it by eliminating -\d{4} part of the pattern. So this is what you're looking for:

    regexp_extract('value', r'((\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}))', 1).alias('timestamp'),
    

    Online Demo

    Also as a explanation:

    • - matches a dash plus a space after it literally
    • \d matches a digit
    • {4} limits it to only 4 digits