Search code examples
awkreplacemoodlegift

Using awk to process html-related Gift-format Moodle questions


This is basically a awk question but it is about processing data for the Moodle Gift format, thus the tags.

I want to format html code in a question (Moodle "test" activity) but I need to replace < and > with the corresponding entities, as these will be interpreted as "real" html, and not printed. However, I want to be able to type the question with regular code and post-process the file before importing it as gift into Moodle.

I thought awk would be the perfect tool to do this.

Say I have this (invalid as such) Moodle/gift question:

::q1::[html]This is a question about HTML:
<pre>
<p>some text</p>
</pre>
and some tag:<code><img></code>
{T}

What I want is a script that translates this into a valid gift question:

::q1::[html]This is a question about HTML:
<pre>
&lt;p&gt;some text&lt;/p&gt;
</pre>
and some tag:<code>&lt;img&gt;</code>
{T}

key point: replace < and > with &lt; and &gt; when:

  1. inside a <pre>-</pre> bloc (assuming those tags are alone on a line)
  2. between <code>and </code>, with arbitrary string in between.

For the first part, I'm fine. I have a shell script calling awk (gawk, actually).

awk -f process_src2gift.awk $1.src >$1.gift

with process_src2gift.awk:

BEGIN { print "// THIS IS A GENERATED FILE !" }
{
    if( $1=="<pre>" ) # opening a "code" block
    {
        code=1;
        print $0;
    }
    else
    {
        if( $1=="</pre>" ) # closing a "code" block
        {
            code=0;
            print $0;
        }
        else
        { # if "code block", replace < > by html entities
            if( code==1 )
            {
                gsub(">","\\&gt;");
                gsub("<","\\&lt;");
            }
            print $0;
        }
    }
}
END { print "// END" }

However, I'm stuck with the second requirement..

Questions:

  1. Is it possible to add to my awk script code to process the hmtl code inside the <code> tags? Any idea ? I thought about using sed but I didn't see how to do that.

  2. Maybe awk isn't the right tool for that ? I'm open for any suggestion on other (standard Linux) tool.


Solution

  • Answering own question.

    I found a solution by doing a two step awk process:

    • first step as described in question
    • second step by defining <code> or </code> as field delimiter, using a regex, and process the string replacement on second argument ($2).

    The shell file becomes:

    echo "Step 1"
    awk -f process_src2gift.awk $1.src >$1.tmp
    
    echo "Step 2"
    awk -f process_src2gift_2.awk $1.tmp >$1.gift
    
    rm $1.tmp
    

    And the second awk file (process_src2gift_2.awk) will be:

    BEGIN { FS="[<][/]?[c][o][d][e][>]"; }
    {
        gsub(">","\\&gt;",$2);
        gsub("<","\\&lt;",$2);
        if( NF >= 3 )
            print $1 "<code>" $2 "</code>" $3
        else
            print $0
    }
    

    Of course, there are limitations:

    • no attributes in the <code> tag
    • only one pair <code></code> in the line
    • probably others...