Search code examples
htmlluapandoc

Inline conversion of headers from RTF to HTML with Pandoc


I have an rtf file that, ultimately, I want to convert into a chunked html, splitting on the level 1 headings.

My first step is to convert the rtf to one html file, which is straightforward with:

pandoc -f rtf -t html -o inputfile.html inputfile.rtf

The resulting html file has headings defined by <strong></strong> rather than <h1></h1> so I have to edit the file in a text editor to change all these. Here is a sample from the file:

<p><strong>George Stewart</strong></p>
<p>Title: George Stewart</p>
<p>Type: Task</p>
<p>Date:1734</p>
<p>Description: Christening</p>
<p>Status: +open</p>
<p>Repository: LDS Library</p>
<p>Last action: 8 May 2024</p>
<p><strong>Ann Hill</strong></p>
<p>Title: Ann Hill</p>
<p>Type: Task</p>
<p>Date: 1799</p>
<p>Description: Family</p>
<p>Status: +ToDo</p>
<p>Repository: LDS Library</p>

which has to be edited to:

<p><h1>George Stewart</h1></p>
<p>Title: George Stewart</p>
<p>Type: Task</p>
<p>Date:1734</p>
<p>Description: Christening</p>
<p>Status: +open</p>
<p>Repository: LDS Library</p>
<p>Last action: 8 May 2024</p>
<p><h1>Ann Hill</h1></p>
<p>Title: Ann Hill</p>
<p>Type: Task</p>
<p>Date: 1799</p>
<p>Description: Family</p>
<p>Status: +ToDo</p>
<p>Repository: LDS Library</p>

Then I can run the next step which is to chunk the html into many files splitting at the h1 level with another Pandoc command.

pandoc -t chunkedhtml --split-level=1 -o RN_File inputfile.html

I would like to be able to do that heading conversion inline as part of the Pandoc command. It may be possible with a filter (json/lua?) but I cannot work out the syntax.

Ideally, I would also like to merge the two Pandoc steps, but do not know if this is possible. It seems there might be a method of doing this with a pipe function, but perhaps someone could confirm with an example.

The Pandoc Lua filters guide suggests I need a code block like:

function Strong(elem)
  return pandoc.SmallCaps(elem.content)
end

but I need to capture <p><strong> and replace with <h1>, this does not work but may be gives a clue of what I am trying to achieve ...

function Para+Strong(elem)
  return pandoc.Header(1)
end

Solution

  • You could use sed on the inputfile.html between the two pandoc commands.

    #!/bin/bash
    
    pandoc -f rtf -t html -o inputfile.html inputfile.rtf
    
    cat inputfile.html | sed 's/<p><strong>\(.*\)<\/strong><\/p>/<h1>\1<\/h1>/g' > inputfile-fixed.html && rm inputfile.html
    
    pandoc -t chunkedhtml --split-level=1 -o RN_File inputfile-fixed.html
    

    Save as: fix_heading.sh
    Change mode executable: chmod +x fix_heading.sh
    Usage: ./fix_heading.sh


    I used cat as a precaution. If you want to directly edit the file, inline, replace the cat line with:

    sed -i 's/<p><strong>\(.*\)<\/strong><\/p>/<h1>\1<\/h1>/g' inputfile.html
    

    That will eliminate need of the intermediate file.