Search code examples
pythonregexbibtex

Extract cited bibtex keys from tex file using regex in python


I'm trying to extract cited BibTeX keys from a LaTeX document using regex in python.

I'd like to exclude the citation if it is commented out (% in front) but still include it if there is a percent sign (\%) in front.

Here is what I came up with so far:

\\(?:no|)cite\w*\{(.*?)\}

An example to try it out:

blablabla
Author et. al \cite{author92} bla bla. % should match
\citep{author93} % should match
\nocite{author94} % should match
100\%\nocite{author95} % should match
100\% \nocite{author95} % should match
%\nocite{author96} % should not match
\cite{author97, author98, author99} % should match
\nocite{*} % should not match

Regex101 testing: https://regex101.com/r/ZaI8kG/2/

I appreciate any help.


Solution

  • Use the newer regex module (pip install regex) with the following expression:

    (?<!\\)%.+(*SKIP)(*FAIL)|\\(?:no)?citep?\{(?P<author>(?!\*)[^{}]+)\}
    

    See a demo on regex101.com.


    More verbose:

    (?<!\\)%.+(*SKIP)(*FAIL)     # % (not preceded by \) 
                                 # and the whole line shall fail
    |                            # or
    \\(?:no)?citep?              # \nocite, \cite or \citep
    \{                           # { literally
        (?P<author>(?!\*)[^{}]+) # must not start with a star
    \}                           # } literally
    


    If installing another library is not an option, you need to change the expression to

    (?<!\\)%.+
    |
    (\\(?:no)?citep?
    \{
        ((?!\*)[^{}]+)
    \})
    

    and need to check programatically if the second capture group has been set (is not empty, that is).
    The latter could be in Python:

    import re
    
    latex = r"""
    blablabla
    Author et. al \cite{author92} bla bla. % should match
    \citep{author93} % should match
    \nocite{author94} % should match
    100\%\nocite{author95} % should match
    100\% \nocite{author95} % should match
    %\nocite{author96} % should not match
    \cite{author97, author98, author99} % should match
    \nocite{*} % should not match
    """
    
    rx = re.compile(r'''(?<!\\)%.+|(\\(?:no)?citep?\{((?!\*)[^{}]+)\})''')
    
    authors = [m.group(2) for m in rx.finditer(latex) if m.group(2)]
    print(authors)
    

    Which yields

    ['author92', 'author93', 'author94', 'author95', 'author95', 'author97, author98, author99']