I'm trying to extract cited BibTeX keys from a LaTeX document using regex in python.
I'd like to exclude the citation if it is commented out (% in front) but still include it if there is a percent sign (\%) in front.
Here is what I came up with so far:
\\(?:no|)cite\w*\{(.*?)\}
An example to try it out:
blablabla
Author et. al \cite{author92} bla bla. % should match
\citep{author93} % should match
\nocite{author94} % should match
100\%\nocite{author95} % should match
100\% \nocite{author95} % should match
%\nocite{author96} % should not match
\cite{author97, author98, author99} % should match
\nocite{*} % should not match
Regex101 testing: https://regex101.com/r/ZaI8kG/2/
I appreciate any help.
Use the newer regex
module (pip install regex
) with the following expression:
(?<!\\)%.+(*SKIP)(*FAIL)|\\(?:no)?citep?\{(?P<author>(?!\*)[^{}]+)\}
(?<!\\)%.+(*SKIP)(*FAIL) # % (not preceded by \)
# and the whole line shall fail
| # or
\\(?:no)?citep? # \nocite, \cite or \citep
\{ # { literally
(?P<author>(?!\*)[^{}]+) # must not start with a star
\} # } literally
(?<!\\)%.+
|
(\\(?:no)?citep?
\{
((?!\*)[^{}]+)
\})
and need to check programatically if the second capture group has been set (is not empty, that is).
The latter could be in Python
:
import re
latex = r"""
blablabla
Author et. al \cite{author92} bla bla. % should match
\citep{author93} % should match
\nocite{author94} % should match
100\%\nocite{author95} % should match
100\% \nocite{author95} % should match
%\nocite{author96} % should not match
\cite{author97, author98, author99} % should match
\nocite{*} % should not match
"""
rx = re.compile(r'''(?<!\\)%.+|(\\(?:no)?citep?\{((?!\*)[^{}]+)\})''')
authors = [m.group(2) for m in rx.finditer(latex) if m.group(2)]
print(authors)
Which yields
['author92', 'author93', 'author94', 'author95', 'author95', 'author97, author98, author99']