Search code examples
regexpcreheredoc

Catch token word in bash 'heredoc'


The beginning of a 'heredoc' string in bash usually looks like

cat <<EOF or cat << EOF

i.e. there may or may not be a space between the two less-than characters and the token word "EOF". I would like to catch the token word, so I try the following

$ pcretest
PCRE version 8.45 2021-06-15

  re> "^\s*cat.*[^<]<{2}[^<](.*)"
data> cat << EOF
 0: cat << EOF
 1: EOF
data> cat <<EOF
 0: cat <<EOF
 1: OF

As you can see in the string where there is no space between the << and EOF, I only catch "OF" and not "EOF". The expression must match exactly two less-than signs and fail if there are three or more. But why does it gobble up the 'E' so that only "OF" is returned?


Solution

  • In your pattern using are using a negated character class [^<] which matches a single character other than < which in this case is the E char in the string <<EOF

    For your examples and using pcre, you could match the leading spaces and then match << without a following <

    ^\h*cat\h+<<(?!<)(.*)
    

    The pattern matches:

    • ^ Start of string
    • \h* Match optional horizontal whitespace chars
    • cat\h+ Match cat and 1+ horizontal whitespace chars
    • <<(?!<) Match << and assert not < directly to the right
    • (.*) Capture optional chars in group 1

    See a regex demo