I'm using pandas.Series.str.extract function to extract year and quarter from strings. But I can not make to do it correctly.
These are the strings that are in series.
Divvy_Stations_2013
Divvy_Stations_2014-Q1Q2
Divvy_Stations_2014-Q3Q4
Divvy_Stations_2015
Divvy_Stations_2016_Q1Q2
Divvy_Stations_2016_Q3
Divvy_Stations_2016_Q4
Divvy_Stations_2017_Q1Q2
Divvy_Stations_2017_Q3Q4
My best regex try it got these match, previously I tried to use quantifiers with groups but I only got nans in both column.
tables['origin'].drop_duplicates().str.extract(pat=r'.*(\d{4}).*(Q[1-4]).*')
It is almost ok but in the first and fourth rows I only got nans. I know that these strings doesn't contain Q\d. So, is fine get "nan" in that column but not "nan" in years column.
NaN NaN
2014 Q2
2014 Q4
NaN NaN
2016 Q2
2016 Q3
2016 Q4
2017 Q2
2017 Q4
The solution is replace the second *
by ?
and adding a ?
after the second group.
originally I had:
'.*(\d{4}).*(Q[1-4]).*'
and the solution is:
'.*(\d{4}).?(Q[1-4])?.*'
Previously I used this: '.*(\d{4}).*(Q[1-4])?.*'
but did't work. Why replacing the *
by ?
works? the description of *
say "Zero o more times" and the description of ?
says "Once or none".
Does none
is different of Zero
in regex language?