Search code examples
bashshellawksedscripting

Bash script logic to extract text stretched across multiple lines in a file based on delimiters


In my script, I need to store the contents of TEXT('...') from file $CURFILEPATH into a bash variable named $SRCTEXT.

The TEXT('...') variable is included in various files that contain IBM i CLLE commands.

In CLLE, the + is a continuation character, so ignore that at the end of the line.

The TEXT('...') target might also contain doubled single quotes, like this: TEXT('Bob O''Malley''s favorite DTAARA'). It might also contain other characters like (, )`,

Here is a straightforward example of a file where the $SRCTEXT to extract is on a single line:

/* Create and set data area for PHP binary location - 1.0.24 */
CRTDTAARA  DTAARA(PHPPATH) TYPE(*CHAR) LEN(255) +
VALUE(' ') TEXT('Path to PHP Binaries')

For that file $SRCTEXT should be "Path to PHP Binaries".

And here is a more difficult example where the TEXT('...') variable stretches across multiple lines, via continuation character +.

/* Create and set data area for Python binary location - 1.05 */
CRTDTAARA  DTAARA(PYPATH) TYPE(*CHAR) LEN(255) +
VALUE('/QOpenSys/pkgs/bin') TEXT('Path to +
Python Binaries')

For that file $SRCTEXT should be "Path to Python Binaries"

Additional edge case example file that uses ''s and ()s in the TEXT('...') target

/* Create and set data area for Python binary location - 1.05 */
CRTDTAARA  DTAARA(PYPATH) TYPE(*CHAR) LEN(255) +
VALUE('/QOpenSys/pkgs/bin') TEXT('Path to +
Python Language''s Binaries (this is an edge case)')

For that file $SRCTEXT should be "Path to Python Language''s Binaries (this is an edge case)"

Note that the quotes should remain doubled.

Though unlikely, the TEXT('...') variable could stretch across 3 or more lines with continuation characters. It would be nice to handle that, but a 2 line solution is acceptable.

Any Bash solution using awk, sed, grep, etc... is acceptable.

ChatGPT gave me something like grep -oP "(?<=TEXT ')[^']+" $CURFILEPATH but that wasn't working.


Solution

  • Using GNU awk for multi-char RS and RT (and the \< word boundary):

    $ awk -v RS='\\<TEXT[(]\047(([^\047]|\047\047)+)\047[)]' 'RT{$0=RT; gsub(/^[^\047]+\047|\047[^\047]+$/,""); gsub(/\+\n/,""); gsub(/\047\047/,"\047"); print}' file1
    Path to PHP Binaries
    

    $ awk -v RS='\\<TEXT[(]\047(([^\047]|\047\047)+)\047[)]' 'RT{$0=RT; gsub(/^[^\047]+\047|\047[^\047]+$/,""); gsub(/\+\n/,""); gsub(/\047\047/,"\047"); print}' file2
    Path to Python Binaries
    

    $ awk -v RS='\\<TEXT[(]\047(([^\047]|\047\047)+)\047[)]' 'RT{$0=RT; gsub(/^[^\047]+\047|\047[^\047]+$/,""); gsub(/\+\n/,""); gsub(/\047\047/,"\047"); print}' file3
    This is JDubbTX's Text
    

    $ awk -v RS='\\<TEXT[(]\047(([^\047]|\047\047)+)\047[)]' 'RT{$0=RT; gsub(/^[^\047]+\047|\047[^\047]+$/,""); gsub(/\+\n/,""); gsub(/\047\047/,"\047"); print}' file4
    Path to Python Language's Binaries (this is an edge case)
    

    I just noticed you said:

    Note that the quotes should remain doubled.

    That's not how scripts like this are usually required to work but it's trivial to do if you really want that - if you want the doubled single quotes from the input, Language''s, to remain doubled instead of single, Language's, in the output then just remove gsub(/\047\047/,"\047"); from the code.

    See https://www.gnu.org/software/gawk/manual/gawk.html#gawk-split-records for information on RS and RT, and http://awk.freeshell.org/PrintASingleQuote for what \047 means.

    To save the output of any of the above in a shell variable you can do:

    $ srctext=$(awk -v RS='\\<TEXT[(]\047(([^\047]|\047\047)+)\047[)]' 'RT{$0=RT; gsub(/^[^\047]+\047|\047[^\047]+$/,""); gsub(/\+\n/,""); gsub(/\047\047/,"\047"); print}' file3)
    $ echo "$srctext"
    This is JDubbTX's Text
    

    just like you'd save the output of any other Unix command. Don't use all upper case for non-environment (exported) shell variables by the way, see Correct Bash and shell script variable capitalization.

    The above was run on these input files:

    $ head file1 file2 file3 file4
    ==> file1 <==
    /* Create and set data area for PHP binary location - 1.0.24 */
    CRTDTAARA  DTAARA(PHPPATH) TYPE(*CHAR) LEN(255) +
    VALUE(' ') TEXT('Path to PHP Binaries')
    For that file $SRCTEXT should be "Path to PHP Binaries".
    
    ==> file2 <==
    /* Create and set data area for Python binary location - 1.05 */
    CRTDTAARA  DTAARA(PYPATH) TYPE(*CHAR) LEN(255) +
    VALUE('/QOpenSys/pkgs/bin') TEXT('Path to +
    Python Binaries')
    
    ==> file3 <==
    CRTDTAARA DTAARA(JWEIRICH1/MYDTA) TYPE(*CHAR) LEN(30) TEXT('This is JDubbTX''s Text')
    
    ==> file4 <==
    /* Create and set data area for Python binary location - 1.05 */
    CRTDTAARA  DTAARA(PYPATH) TYPE(*CHAR) LEN(255) +
    VALUE('/QOpenSys/pkgs/bin') TEXT('Path to +
    Python Language''s Binaries (this is an edge case)')