Search code examples
pandasregex-group

capturing groups regex with cuantifiers


I'm using pandas.Series.str.extract function to extract year and quarter from strings. But I can not make to do it correctly.

These are the strings that are in series.

Divvy_Stations_2013
Divvy_Stations_2014-Q1Q2
Divvy_Stations_2014-Q3Q4
Divvy_Stations_2015
Divvy_Stations_2016_Q1Q2
Divvy_Stations_2016_Q3
Divvy_Stations_2016_Q4
Divvy_Stations_2017_Q1Q2
Divvy_Stations_2017_Q3Q4

My best regex try it got these match, previously I tried to use quantifiers with groups but I only got nans in both column.

tables['origin'].drop_duplicates().str.extract(pat=r'.*(\d{4}).*(Q[1-4]).*')

It is almost ok but in the first and fourth rows I only got nans. I know that these strings doesn't contain Q\d. So, is fine get "nan" in that column but not "nan" in years column.


NaN     NaN
2014    Q2
2014    Q4
NaN     NaN
2016    Q2
2016    Q3
2016    Q4
2017    Q2
2017    Q4

Solution

  • The solution is replace the second * by ? and adding a ? after the second group.

    originally I had:

    '.*(\d{4}).*(Q[1-4]).*'

    and the solution is:

    '.*(\d{4}).?(Q[1-4])?.*'

    Previously I used this: '.*(\d{4}).*(Q[1-4])?.*' but did't work. Why replacing the * by ? works? the description of * say "Zero o more times" and the description of ? says "Once or none".

    Does none is different of Zero in regex language?