Search code examples
luatext-processing

How to merge paragraphs into a single line in lua?


How can I take a block of text in lua and merge it into a single line similar to how html or markdown ignore line breaks:

  • remove single line breaks
  • keep line breaks when two or more are consecutive
  • always keep header line breaks

My headers are easily identified: their lines never start with a letter character.

I've figured out this pattern: (\n%a.-)\n(%S). But it doesn't merge all the newlines. (I'm using >_< to make it easy to see the merged lines.)

(Note: "procedural" has trailing space.)

>>> t = [[
                                                               *lrv-section-1*
1 -- Introduction~

Lua is a powerful, efficient, lightweight, embeddable scripting language.
It supports procedural programming,
object-oriented programming, functional programming,
data-driven programming, and data description.

Lua combines simple procedural 
syntax with powerful data description
constructs based on associative arrays and extensible semantics.
Lua is dynamically typed,
runs by interpreting bytecode with a register-based
virtual machine,
and has automatic memory management with
incremental garbage collection,
making it ideal for configuration, scripting,
and rapid prototyping.

]]

>>> print(t:gsub("(\n%a.-)\n(%S)", "%1>_<%2"))
[[
                                                               *lrv-section-1*
1 -- Introduction~

Lua is a powerful, efficient, lightweight, embeddable scripting language.>_<It supports procedural programming,
object-oriented programming, functional programming,>_<data-driven programming, and data description.

Lua combines simple procedural >_<syntax with powerful data description
constructs based on associative arrays and extensible semantics.>_<Lua is dynamically typed,
runs by interpreting bytecode with a register-based>_<virtual machine,
and has automatic memory management with>_<incremental garbage collection,
making it ideal for configuration, scripting,>_<and rapid prototyping.

 7

"constructs" and many others aren't merged to the previous line.

The more naive "(.)\n(%S)" kinda works, but it removes the double line breaks and I'm not sure how I could ensure my section titles maintain their whitespace.

>>> print(t:gsub("(.)\n(%S)", "%1>_<%2"))
                                                               *lrv-section-1*>_<1 -- Introduction~
>_<Lua is a powerful, efficient, lightweight, embeddable scripting language.>_<It supports procedural programming,>_<object-oriented programming, functional programming,>_<data-driven programming, and data description.
>_<Lua combines simple procedural >_<syntax with powerful data description>_<constructs based on associative arrays and extensible semantics.>_<Lua is dynamically typed,>_<runs by interpreting bytecode with a register-based>_<virtual machine,>_<and has automatic memory management with>_<incremental garbage collection,>_<making it ideal for configuration, scripting,>_<and rapid prototyping.

I'm trying to adapt lua's 2html document processor script to output a vim help file. I plan to use lume.wordwrap to wrap the lines after merging.


Solution

  • You can use

    (%S)[^%S\n]*\n([%a()])
    

    This pattern matches:

    • (%S) - (captures into Group 1, %1) any non-whitespace character
    • [^%S\n]* - matches (without capturing) zero or more chars other than non-whitespace and a newline (that is, it is the %s pattern without \n)
    • \n - a newline char
    • ([%a()]) - (captures into Group 2, %2) any letter, ( or ) chars.