Search code examples
perlawk

To move character sequences to the next line depending on the position of this line within paragraph


Here is an example text:

  Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do a
eiusmod tempor incididunt ut labore et dolore magna aliqua. Do b c
quis autem vel eum iure reprehenderit qui in ea voluptate velit esse
---minim a b veniam, quis nostrud exercitation ullamco laboris d.
  Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do a
eiusmod tempor incididunt ut labore et dolore magna aliqua. Do b c
quis autem vel eum iure reprehenderit qui in ea voluptate velit esse
---minim a b veniam, quis nostrud exercitation ullamco laboris d.

I need to process this text using Awk or maybe Perl so that

  • Rule 1: Each single-letter word, if it happened to be at the end of a line, and this line is not the last line of a paragraph, is moved to the next line.

  • Rule 2: Otherwise, it is moved to the next line together with the nearest word that is at least two letters.

  • Rule 3: Three hyphens, if they happened to be at the beginning of a line, and this line is not the first line of a paragraph, are treated the same as single-letter words in Rule 2.

That is, the text above should be re-formatted as follows:

  Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do
a eiusmod tempor incididunt ut labore et dolore magna aliqua. Do
b c quis autem vel eum iure reprehenderit qui in ea voluptate velit
esse---minim a b veniam, quis nostrud exercitation ullamco
laboris d.
  Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do
a eiusmod tempor incididunt ut labore et dolore magna aliqua. Do
b c quis autem vel eum iure reprehenderit qui in ea voluptate velit
esse---minim a b veniam, quis nostrud exercitation ullamco
laboris d.

I understand that probably nobody will waste his/her time to write the whole script for me, but I need at least to have some entry points to start working on. Maybe a solution that is 50% or 25% workable.

To ease the task, we can assume the paragraphs are separated using a blank line instead of first-line indent:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do a
eiusmod tempor incididunt ut labore et dolore magna aliqua. Do b c
quis autem vel eum iure reprehenderit qui in ea voluptate velit esse
---minim a b veniam, quis nostrud exercitation ullamco laboris d.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do a
eiusmod tempor incididunt ut labore et dolore magna aliqua. Do b c
quis autem vel eum iure reprehenderit qui in ea voluptate velit esse
---minim a b veniam, quis nostrud exercitation ullamco laboris d.

Solution

  • Use perl, you can load your data with paragraph-mode. for your sample text, I use \n to split each paragraph, then do some s/pattern/replacement/ operations on each paragraph to build your rules, see below:

    perl -C -lpe '
        BEGIN{ $/="\n  " }                   # setup RS 
        s/ (\w(?: \w)*)\r?\n/\n$1 /g;        # rule-1: 1+ consecutive single-char words followed by newline switched to the next line  
        s/ (\w{2,}(?: \w)+[?.!])\s*$/\n$1/;  # rule-2: 1+ consecutive single-char words at end of para(trailing with `.` or `?` or `!`) and some potential whitespaces including \r. (extra empty newlines will be removed from the result)
        s/ (\w+)\r?\n(?=---)/\n$1/g;         # rule-3: rule for `---`
        s/^ */  /                            # fix the missing leading spaces for paragraphs
    ' file
    

    For your sample text, this yields:

      Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do
    a eiusmod tempor incididunt ut labore et dolore magna aliqua. Do
    b c quis autem vel eum iure reprehenderit qui in ea voluptate velit
    esse---minim a b veniam, quis nostrud exercitation ullamco
    laboris d.
      Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do
    a eiusmod tempor incididunt ut labore et dolore magna aliqua. Do
    b c quis autem vel eum iure reprehenderit qui in ea voluptate velit
    esse---minim a b veniam, quis nostrud exercitation ullamco
    laboris d.