Search code examples
linuxbashtextsplittext-extraction

Extracting multi line text from a file between delimiters in bash


I'm trying to extract a multi line text from a text file where values are separated by delimiters and save it into a string or an array. Most of the values are extracted and saved to a variable by awk but the problem occurs when I need to extract a multi line description of a specific product into a variable/array.

The simplified input file syntax looks like this: ID;Name;value1;value2;DESCRIPTION;valueX;valueY;

I'm extracting the first values with awk -F ";" '{print $1}' assigning them to variables fro future manipulation and it works fine but the problem occurs at the "DESCRIPTION" part since its multi line with HTML tags. An example of how the DESCRIPTION looks like:

value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>


<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>

<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY

Can you suggest a way of getting this done the way so I can assign the DESCRIPTION in to some kind of variable or an array within the bash script and manipulate it further on?


Solution

  • You (originally) asked for an awk-based solution. As others mentioned in the comments there are better tools for the job. That said, based on 4.9 Multiple-Line Records and 4.7 Defining Fields by Content you can try something like:

    $ awk --version
    GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
    [...]
    $ awk 'BEGIN {RS = ";\n"; FPAT = "([^;]+)|(\"<p.+p>\")" } { print "NF = ", NF; for (i = 1; i <= NF; i++) { printf("$%d = %s\n", i, $i) } }' testfile
    
    1. RS = ";\n" is here assuming that your input file has multiple ID;Name;value1;value2;DESCRIPTION;valueX;valueY; records and that the records are separated with a ; (this is the ; after valueY in your example) followed by a newline.
    2. FPAT = "([^;]+)|(\"<p.+p>\")" is a "best-effort" approach to tell (g)awk how the fields of your records look like. You may need to modify it according to your needs. What is actually says is that there are two field formats (see (...)|(...)). The first field format captures strings that do not contain ; and is used to capture all the fields except DESCRIPTION. The second field format captures strings that start with "< and end with >".

    Against a file with 2 ID;Name;value1;value2;DESCRIPTION;valueX;valueY;:

    $ cat testfile 
    ID;Name;value1;value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
    
    
    <strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
    <p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
    
    <p style=""text-align: center;"">
    Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY;
    ID;Name;value1;value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
      
    
    <strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
    <p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
    
    <p style=""text-align: center;"">
    Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY;
    
    $ awk 'BEGIN {RS = ";\n"; FPAT = "([^;]+)|(\"<p.+p>\")" } { print "NF = ", NF; for (i = 1; i <= NF; i++) { printf("$%d = %s\n", i, $i) } }' testfile
    NF =  7
    $1 = ID
    $2 = Name
    $3 = value1
    $4 = value2
    $5 = "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
    
    
    <strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
    <p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
    
    <p style=""text-align: center;"">
    Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"
    $6 = valueX
    $7 = valueY
    NF =  7
    $1 = ID
    $2 = Name
    $3 = value1
    $4 = value2
    $5 = "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
      
    
    <strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
    <p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
    
    <p style=""text-align: center;"">
    Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"
    $6 = valueX
    $7 = valueY