Search code examples
regexbashsedgrepcut

Get list of strings between certain strings in bash


Given a text file (.tex) which may contain strings of the form "\cite{alice}", "\cite{bob}", and so on, I would like to write a bash script that stores the content within brackets of each such string ("alice" and "bob") in a new text file (say, .txt). In the output file I would like to have one line for each such content, and I would also like to avoid repetitions.

Attempts:

  • I thought about combining grep and cut. From other questions and answers that I have seen on Stack Exchange I think that (modulo reading up on cut a bit more) I could manage to get at least one such content per line, but I do not know how to get all occurences of a single line if there are several such strings in it and I have not seen any question or answer giving hints in this direction.
  • I have tried using sed as well. Yesterday I read this guide to see if I was missing some basic sed command, but I did not see any straightforward way to do what I want (the guide did mention that sed is Turing complete, so I am sure there is a way to do this only with sed, but I do not see how).

Solution

  • What about:

    grep -oP '(?<=\\cite{)[^}]+(?=})' sample.tex | sort -u > cites.txt
    
    • -P with GNU grep interprets the regexp as a Perl-compatible one (for lookbehind and lookahead groups)
    • -o "prints only the matched (non-empty) parts of a matching line, with each such part on a separate output line" (see manual)
    • The regexp matches a curly-brace-free text preceded by \cite{ (positive lookbehind group (?<=\\cite{)) and followed by a right curly brace (positive lookafter group (?=})).
    • sort -u sorts and remove duplicates

    For more details about lookahead and lookbehind groups, see Regular-Expressions.info dedicated page.