How to extract timestamp and remove tailing portion from weblog using regex in pyspark?

I am doing some practice on weblog parsing and here is a question on regex:

The log file is in the format of:

in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839

I need to get the timestamp, here is what I have now:

regexp_extract('value', r'((\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}))', 1).alias('timestamp'),

This returns me:

01/Aug/1995:00:00:01 -0400

My question is what does -0400 means? time zone? How do I remove it?

Solution

Yes - that's a timezone.

You can simply remove it by eliminating -\d{4} part of the pattern. So this is what you're looking for:

regexp_extract('value', r'((\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}))', 1).alias('timestamp'),

Online Demo

Also as a explanation:

- matches a dash plus a space after it literally
\d matches a digit
{4} limits it to only 4 digits

Replacing with conditional value in dplyr case_when()
Calculating moving average
Estimating non-monotonic bi-exponential curve fit
column type issue when converting csv to parquet using duckdb in R
"Target position can only be set for new windows" in chromote in R
Determine level of nesting in R?
Week start on Mondays
Center output from dm_draw
plot a network based on given values
Adding a X axis title to faceted ggballoonplot
Calculate mean of matrices having different dimensions
check if two columns have a one-to-one relationship in R
How to extract Std.Dev from VarCorr glmmTMB
How do you print to stderr in R?
How to plot China map with South China Sea in base R
Get column and row position of nth element in a matrix
Is there any authoritative documentation on R release nicknames?
R Glassdoor Web Scraping
Issue with graticule across 180° for several country/territory EEZs
Separating grouped layers in a raster stack in terra
How can I use group_by and mutate to perform a subtraction calculation with specific groupings? Time 0 minus Time X for all groups
How to directly open .R data containing data frame code in R?
Way to web-scrape a popular eSport website using R?
Variance calculation warning: longer object length is not a multiple
gratia::draw(): "'length.out' must be a non-negative number"
Using Swift as custom engine in knitr and including all previous content
convert source target value dataframe into a correlation matrix
ggplot2 plotting a 100% stacked area chart
Use string as formula for ipwtm function?
interpolarization within groups with NA