Search code examples
pythonregexpython-3.xregex-greedy

regular expression python index:count


I have list of values as string "index:count" I want to extract the index and count in the string as in the below code:

          string="358:6 1260:2 1533:7 1548:292 1550:48 1561:3 1564:186"
          values=[v for v in re.findall('.+?:.+?.', string)]
          for g in values:
              index=g[:g.index(":")]
              count=g[g.index(":")+1:]
              print(int(index)+" "+str(count))

But I got error message

ValueError: invalid literal for int() with base 10: '2 1550'

it seems I wrote the regular expression operations wrongly. any idea how to fix this?


Solution

  • I think you won't need the ? lazy modifier at the end of the regex pattern. The ? lazy modifier you put there can actually produce more noise than capturing the right data

    EDIT NOTE: the pattern .+:.+ I introduced in previous edits was a wrong or even a bad regex pattern to capture the desired pattern. Please use the \d+:\d+ pattern instead. However, I leave it be because it still can solve the OP's problem using another workaround.

    As long as your data is not malformed or contain noise and is neatly separated with a whitespace, I think '.+:.+' is sufficient to find your index:count format. Probably the best way is to use \d+:\d+ since you know it is at least one digit separated by a : and followed by another digit.

    Here are good links regexr and regex101 to better design/visualize your regex pattern.

    If you use the .+:.+ pattern, it will return you the string as a whole since it matches the string as a whole. You need to preprocess the result since re.findall returns a list, in this example, it returns only 1 element.

    In [  ]: string="358:6 1260:2 1533:7 1548:292 1550:48 1561:3 1564:186"
        ...: values=[v for v in re.findall('.+:.+', string)]
        ...: print(values)
    ['358:6 1260:2 1533:7 1548:292 1550:48 1561:3 1564:186']
    

    Since it returns a list with only one element, you can use pop() to take the only str element out and print it nicely with str function split().

    In [  ]: print(values.pop().split())
    ['358:6', '1260:2', '1533:7', '1548:292', '1550:48', '1561:3', '1564:186']
    

    If you are using \d+:\d+ pattern, it will directly return you a nicely separated list since it correctly finds them. Therefore, you can directly print its value.

    In [  ]: string="358:6 1260:2 1533:7 1548:292 1550:48 1561:3 1564:186"
        ...: values=[v for v in re.findall('\d+:\d+', string)]
        ...: print(values)
    ['358:6', '1260:2', '1533:7', '1548:292', '1550:48', '1561:3', '1564:186']
    

    Finally, you can print the result nicely with built-in string formatting. Disclaimer: I do not own that website, I just found it useful for beginner me :)

    In [  ]: for s in values:
        ...:     index, count = s.split(":")
        ...:     print("Index: {:>8} Count: {:>8}".format(index, count))
        ...:     
    Index:      358 Count:        6
    Index:     1260 Count:        2
    Index:     1533 Count:        7
    Index:     1548 Count:      292
    Index:     1550 Count:       48
    Index:     1561 Count:        3
    Index:     1564 Count:      186