Search code examples
pythonregexpython-re

RegEx for capturing scientific citations


I am trying to capture brackets of text that have at least one digit in them (think citations). This is my regex now, and it works fine: https://regex101.com/r/oOHPvO/5

\((?=.*\d).+?\)

So I wanted it to capture (Author 2000) and (2000) but not (Author).

I am trying to use python to capture all these brackets, but in python it also captures the text in the brackets even if they don't have digits.

import re

with open('text.txt') as f:
    f = f.read()

s = "\((?=.*\d).*?\)"

citations = re.findall(s, f)

citations = list(set(citations))

for c in citations:
    print (c)

Any ideas what I am doing wrong?


Solution

  • You may use

    re.findall(r'\([^()\d]*\d[^()]*\)', s)
    

    See the regex demo

    Details

    • \( - a ( char
    • [^()\d]* - 0 or more chars other than (, ) and digit
    • \d - a digit
    • [^()]* - 0 or more chars other than (, )
    • \) - a ) char.

    See the regex graph:

    enter image description here

    Python demo:

    import re
    rx = re.compile(r"\([^()\d]*\d[^()]*\)")
    s = "Some (Author) and (Author 2000)"
    print(rx.findall(s)) # => ['(Author 2000)']
    

    To get the results without parentheses, add a capturing group:

    rx = re.compile(r"\(([^()\d]*\d[^()]*)\)")
                        ^                ^
    

    See this Python demo.