Search code examples
bashshellsubstring

How to extract substring between two other substrings?


I have a script that reads a log file line-by-line. I need to extract the text between two subtstrings, if they exist in the line my script is currently reading.

For instance, if a line has:

some random text here substring A abc/def/ghi substring B

I need to extract the text abc/def/ghi that is between substring A and substring B by storing it in a variable. How would I go about doing this?

I looked through this Extract substring in Bash but can't find anything that exactly matches my use case.


Solution

  • Bash provides parameter expansion with substring removal that allows you to trim through "substring A"from the front, and then trim "substring B" from the back leaving "abc/def/ghi". For example, you can do:

    ssa="substring A"         ## substrings to find text between
    ssb="substring B"
    
    line="some random text here substring A abc/def/ghi substring B"
    
    text="${line#*${ssa}}"    ## trim through $ssa from the front (left)
    text="${text%${ssb}*}"    ## trim through $ssb from the back (right)
    
    echo $text                ## output result
    

    Example OUtput

    abc/def/ghi
    

    The basic two forms for trimming from the front of a string and the two from trimming from the back of a string are:

    ${var#pattern}      # Strip shortest match of pattern from front of $var
    ${var##pattern}     # Strip longest match of pattern from front of $var
    ${var%pattern}      # Strip shortest match of pattern from back of $var
    ${var%%pattern}     # Strip longest match of pattern from back of $var
    

    Where pattern can contain globbing characters such as '*' and '?'. Look things over and let me know if you have any further questions.

    Using BASH_REMATCH

    BASH_REMATCH is an internal array that contains the results of matching [[ text =~ REGEX ]]. ${BASH_REMATCH[0]} is the total text matched by REGEX and then ${BASH_REMATCH[1..2..etc]} are the matched portions of the regular expression captures between (...) within the regular expression (of which you can provide multiple captures)

    Using the same setup above, you could modify the script the replace the parameter expansions uses with text to use

    regex="^.*${ssa} ([^ ]+) ${ssb}.*$"   ## REGEX to match with (..) capture
    
    [[ $line =~ $regex ]] && echo ${BASH_REMATCH[1]}
    

    Where the regular expression in $regex will match the entire line capturing what is between $ssa and $ssb. The complete modified script would be:

    ssa="substring A"         ## substrings to find text between
    ssb="substring B"
    
    line="some random text here substring A abc/def/ghi substring B"
    
    regex="^.*${ssa} ([^ ]+) ${ssb}.*$"   ## REGEX to match with (..) capture
    
    [[ $line =~ $regex ]] && echo ${BASH_REMATCH[1]}
    

    (same output)

    Both methods are fully explained in man 1 bash. Use whichever fits the circumstance you are faced with. I always found parameter expansion a bit more intuitive (and you can incrementally whittle text down to just about anything you need). However, the power of extended regular expression matching can provide a powerful alternative to the parameter expansions.