Search code examples
pythonregexregex-group

python regex failing to find all matches


My regex is failing to capture all groups when I run in python, and I'm at a loss why...

I'm trying to pull out the sequences of numbers.

import re

my_string = '467..114..'
re.search(r'(?:(\d+)(?:\.*))*',my_string).groups()
# outputs ('114',)

I expect this to get the two groups, '467', and '114'.

To explain the expression (or what I'm thinking for it):

  1. I want to capture a series of digits, needs to have at least one but can have many - (\d+)
  2. This will be followed by zero or more periods/dots, which I don't want to capture - (?:\.*)
  3. This pattern of digits followed by dots could repeat, so the two above are wrapped together in another non-capturing group that can repeat 0-or-more times.

I've gotten things to pull out either the first or second set of digits, but struggling to get both at once...

I do know I could use re.findall() to just get the numbers, but I want the span and I want to know why it isn't working...

Edit:

I've found that re.finditer() does indeed return Match objects, not just the list of strings like re.findall(). That seems like a reasonable way for me to actually go, but still curious about how to get it to work with search, which I feel like should be possible...


Solution

  • This won't work in search because you have a repeated capture group, which will only return the last match (See for example this for an explanation) and it won't work if you remove the repeat, since search only returns the first match (from the manual):

    Scan through string looking for the first location where the regular expression pattern produces a match

    For your use case, you can use finditer, but you still need to remove the repeat on the outer group or you will match the entire string i.e. re.finditer(r'(?:(\d+)(?:\.*))',my_string)

    Note that with the use of finditer, the (?:\.*) is redundant since it will match an empty string. If you want to enforce that the digits are followed by a ., you need to change that to (?:\.+), otherwise, just remove it and you can simply use \d+. Either way, you can also remove most of the groups.

    So, to match the numbers with the dots, you can do:

    [m for m in re.finditer(r'\d+\.+',my_string)]
    

    And you will get

    [
     <re.Match object; span=(0, 5), match='467..'>,
     <re.Match object; span=(5, 10), match='114..'>
    ]
    

    Or to match only the numbers:

    [m for m in re.finditer(r'\d+',my_string)]
    

    Output:

    [
     <re.Match object; span=(0, 3), match='467'>,
     <re.Match object; span=(5, 8), match='114'>
    ]
    

    If you might have other numbers in the string and you only want to match the numbers followed by dots, either match them as above or use a lookahead to assert they are there:

    [m for m in re.finditer(r'\d+(?=\.)',my_string)]
    

    Output:

    [
     <re.Match object; span=(0, 3), match='467'>,
     <re.Match object; span=(5, 8), match='114'>
    ]