Search code examples
perl

Perl: Help to setup RS


The following script (which is created partly for educational purposes, and this is why there is used not only Perl†, but also awk and sed‡) ...

† Version of Perl is 5.34
awk and sed are the ones supplied with macOS.

thisscript input.md output.txt
sed 's/[[:space:]]-[[:space:]]/---/g' $1 |
sed 's/[[:space:]]\{0,1\}—[[:space:]]\{0,1\}/---/g' |
sed 's/\\\*/†/g' |
sed 's/*/\//g' |
sed 's/\\\././g' |
sed 's/…/.../g' |
awk 'BEGIN{RS="";ORS="\n  "}1' |
fold -s -w 72 |
perl -C -lpe '
    BEGIN{ $/="\n  " }
    s/ (\w(?: \w)*)\r?\n/\n$1 /g;
    s/ (\w{2,}(?: \w)+[?.!])\s*$/\n$1/;
    s/ (\w+)\r?\n(?=---)/\n$1/g;
    s/^ */  /
' |
sed 's/[[:space:]]+//g' > $2

is to convert a typical Markdown text (given that the text is simple enough, say, a children book about pirates) to something which is more pleasant to my taste.

Test 1

Input text:

  Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do a
eiusmod tempor incididunt ut labore et dolore magna aliqua. Do b c
quis autem vel eum iure reprehenderit qui in ea voluptate velit esse
---minim a b veniam, quis nostrud exercitation ullamco laboris d.
  Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do a
eiusmod tempor incididunt ut labore et dolore magna aliqua. Do b c
quis autem vel eum iure reprehenderit qui in ea voluptate velit esse
---minim a b veniam, quis nostrud exercitation ullamco laboris d.

Output text:

  Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do
a eiusmod tempor incididunt ut labore et dolore magna aliqua. Do
b c quis autem vel eum iure reprehenderit qui in ea voluptate velit
esse---minim a b veniam, quis nostrud exercitation ullamco
laboris d.
  Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do
a eiusmod tempor incididunt ut labore et dolore magna aliqua. Do
b c quis autem vel eum iure reprehenderit qui in ea voluptate velit
esse---minim a b veniam, quis nostrud exercitation ullamco
laboris d.

As you may have noticed, the Perl part is responsible for moving any single-letter word to the next line. There are other things the Perl part does, but for the purposes of this Unix & Linux question, they are irrelevant. We only need to know whether the Perl part works or not.

Test 2

Input text:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do a eiusmod tempor incididunt ut labore et dolore magna aliqua. Do b c quis autem vel eum iure reprehenderit qui in ea voluptate velit esse---minim a b veniam, quis nostrud exercitation ullamco laboris d.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do a eiusmod tempor incididunt ut labore et dolore magna aliqua. Do b c quis autem vel eum iure reprehenderit qui in ea voluptate velit esse---minim a b veniam, quis nostrud exercitation ullamco laboris d.

The output text should be the same as the output text of the first test, but it is not:

  Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do a
eiusmod tempor incididunt ut labore et dolore magna aliqua. Do b c quis
autem vel eum iure reprehenderit qui in ea voluptate velit esse---minim
a b veniam, quis nostrud exercitation ullamco
laboris d.
  Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do a
eiusmod tempor incididunt ut labore et dolore magna aliqua. Do b c quis
autem vel eum iure reprehenderit qui in ea voluptate velit esse---minim
a b veniam, quis nostrud exercitation ullamco
laboris d.

As you may have noticed, single-letter words have not been moved, that is, the Perl didn't took part during processing the text. As far as I understand (I just stated to learn Perl), this is because the line

BEGIN{ $/="\n  " }

should be adjusted so that it will match blank-line separated paragraphs. But my attempts, like this one:

BEGIN{ $/="\n  |\n\n" }

didn't help.

What I'm doing wrong?


Solution

  • From perldoc perlvar (cf. https://perldoc.pl/perlvar#$/)

        $/      The input record separator, newline by default. This influences
                Perl's idea of what a "line" is. Works like awk's RS variable,
                including treating empty lines as a terminator if set to the
                null string (an empty line cannot contain any spaces or tabs).
                You may set it to a multi-character string to match a
                multi-character terminator, or to "undef" to read through the
                end of file. Setting it to "\n\n" means something slightly
                different than setting to "", if the file contains consecutive
                empty lines. Setting to "" will treat two or more consecutive
                empty lines as a single empty line. Setting to "\n\n" will
                blindly assume that the next input character belongs to the next
                paragraph, even if it's a newline.
    
                    local $/;           # enable "slurp" mode
                    local $_ = <FH>;    # whole file now here
                    s/\n[ \t]+/ /g;
    
                Remember: the value of $/ is a string, not a regex. awk has to
                be better for something. :-)
    

    Note the final sentence.