Format text with footnote by regular expression

I want to transform the annotation of text into the form of footnote. Here is a minimal example of the text.

Paragraph one. This is the first place [1] of paragraph one. This is the second place [2] of paragraph one.

[1] annotation one of paragraph one

[2] annotation two of paragraph one

Paragraph two. This is the first place [1] of paragraph two. This is the second place [2] of paragraph two.

[1] annotation one of paragraph two

[2] annotation two of paragraph two

At the end of each paragraph, there will be several annotations begins with label [1]. Each annotation will form a single paragraph.

What I want to do is to insert those annotations into the text with latex syntax. The desired output of the sample text is,

Paragraph one. This is the first place \footnote{annotation one of paragraph one} of paragraph one. This is the second place \footnote{annotation two of paragraph one} of paragraph one.

Paragraph two. This is the first place \footnote{annotation one of paragraph two} of paragraph two. This is the second place \footnote{annotation one of paragraph two} of paragraph two.

This is a not just a simple replacement by matching pattens. It may have to be performed on a paragraph basis. What do you think is the simplest way to do it?

Edit: I have came up with a possible solution in order to use sed.

remove the newline in front of the annotation,

Paragraph one. This is the first place [1] of paragraph one. This is the second place [2] of paragraph one. [1] annotation one of paragraph one [2] annotation two of paragraph one

Paragraph two. This is the first place [1] of paragraph two. This is the second place [2] of paragraph two. [1] annotation one of paragraph two [2] annotation two of paragraph two

match the pattern

[1] text1 [1] text2 [2]

and replace it with

text2 text1 [2]

basically the first [1] is where the annotation should be inserted; things between [1] and [2] are annotations to be relocated.

These questions are relevant: Remove new line / line break characters only for specific lines How can I remove a line-feed/newline BEFORE a pattern using sed, but I can't make those code work for me the lack of knowledge of regular expression.

Solution

Fundamentally, sed is the wrong tool for this job. You might be able to write a sed script that preprocesses the file and generates a new sed script that processes the file, but you're clutching at straws when there are many much better tools for the task. I'd reach for Perl (but I learned Perl over twenty years ago, and Python only a couple of years ago), but Python is also capable of handling it, and with care you could probably even use awk. Part of the trouble is that you have to save all the text of paragraph one until you reach the start of paragraph two; only then can you start generating the actual text for paragraph one.

I think that the 'sed is the wrong tool' comment remains valid even if the sed script captures the contents of the paragraph in the hold space. Those would be lines not starting with a square bracket. The trouble is, when you come to a line with a square bracket, you need to write a regex that substitutes the tail of the line into the hold space in lieu of the contents of the square brackets. That requires a sort of 'dynamic regex'. Even if you knew there'd never be more than, say, 9 footnotes in a paragraph, so you could consider some sort of hack that wrote out the code 9 times, there are still problems writing the replacement strings in the right places.

Here's a simple script in Perl — well, a not incredibly complex script in Perl — that does the job. The 'whirling loops' (three nested loops) make it a little tricky to understand.

#!/usr/bin/env perl
use strict;
use warnings;

my $para = "";

TEXT:
while (<>)
{
NOTES:
    while (m/^\s*\[(\d+)]\s+(.*)/)
    {
        my $tag = $1;
        my $note = $2;
        $para =~ s/\[$tag]/\\footnote{$note}/m;
        while (<>)
        {
            last if $_ =~ m/^\s*\[/;
            if ($_ !~ m/^\s*$/)
            {
            print $para;
            $para = "";
            last NOTES;
            }
        }
        last TEXT if eof;
    }

    $para .= $_;
}

print "$para";

Given the input file:

Paragraph one. This is the first place [1] of paragraph one. This is the second place [2] of paragraph one.

[1] annotation one of paragraph one

[2] annotation two of paragraph one

Paragraph two. This is the first place [1] of paragraph two. This is the second place [2] of paragraph two.

[1] annotation one of paragraph two

[2] annotation two of paragraph two

The output of this script from that file is:

Paragraph one. This is the first place \footnote{annotation one of paragraph one} of paragraph one. This is the second place \footnote{annotation two of paragraph one} of paragraph one.

Paragraph two. This is the first place \footnote{annotation one of paragraph two} of paragraph two. This is the second place \footnote{annotation two of paragraph two} of paragraph two.

What does the script do?

The outer loop (labelled TEXT) reads lines into $_ until EOF.

The loop labelled NOTES processes the material after a paragraph up to the start of the next one. It knows that it is a footnote line because it starts with a number in square brackets (possibly indented with spaces, and definitely with a space after the close square bracket). When it finds such a line, the number is saved in $tag and the replacement text (must be a single line — no extended multiline footnotes here) is saved in $note. Then the first occurrence of the tag inside square brackets in the saved paragraph is replaced with the footnote notation and the text of the note (this is the part that is nigh-on impossible in a single run of sed, and given that the footnote numbers repeat across paragraphs, makes even two runs of sed problematic). Having done that substitution (not caring if there is no match to replace), it reads the next line, and this is where the loops (and the head) start whirling. If the newly read line is a note line, then the initial last exits the innermost while and returns to the next iteration of the NOTES loop. If the line does not match a blank line, then we must have just read the first line of the next paragraph, so print the previous paragraph (which now has as many substitutions made as there are substitutions to make), empty the saved paragraph, and exit the NOTES loop. Otherwise, ignore the blank line in the middle of the notes.

After the loop, check whether we got EOF and exit the main loop if we did. Otherwise, add the paragraph line that was just read to the saved paragraph.

At the end, print the last saved paragraph.

This has not been exhaustively tested. I've not generated paragraphs with references to missing notes, or notes without references, or notes out of sequence. I think it would 'handle' those by ignoring the issues; there'd still be a reference to the missing note, and unreferenced notes would simply not show up in the output. If the same note number reference appears twice in a paragraph but there's only one note number after the paragraph, the second and subsequent ones are ignored. If the same note number appears twice ('text[1] more[1]') and the notes after the paragraph repeat the number ('[1] note 1A', '[1] note 1B'), then the first will be replaced with 'note 1A' and the second with 'note 1B'. I've not tested multiline paragraphs (but I don't expect trouble). Multiline qualifiers aren't needed for the replacement regex because the reference to a tag cannot be split over lines and isn't anchored on a line.

Processing multiline footnotes is an exercise for the reader (and is not entirely trivial). All else apart, you can't begin substituting a multiline footnote until you find a blank line, another footnote line, or the start of the next paragraph.