Search code examples
regexpcre2

Regex: How to recover roman numbered titles and their respective contents


After extract text from PDFs files using pdftotext, I am trying to recover some their titles and respective contents.

This batch of files have a pattern of a new line followed by a roman number followed (or not) by dot or hyphen and the title followed by break line.

So I tried this pattern:

^[^\S\n]*([CLXVI]{1,7})\.\s?(.*?)\n([\S\s]*)(?=[CLXVI]{1,7})

But did not worked as expected:

https://regex101.com/r/vX4aB4/1

The expected result was something like:

group title -> Breve Síntese da Demanda
group content -> Lorem ipsum dolor ... faucibus.
group title -> Bla Bla bla
group content -> Lorem ipsum dolor ... faucibus.
group title -> Do Mérito
group content -> Lorem ipsum dolor ... commodo.
group title -> Conclusão
group content -> Lorem ipsum dolor ... .

So how Can I improve that to recover properly each title and their respective contents?


Solution

  • You can use a negative lookahead to prevent skipping over, e.g.

    ^(\h*+[CLXVI]{1,7}\.)\h*(.+)\s*((?:(?!(?1)).*\R?)*)
    

    See your updated demo at regex101 - Use in (?m) multiline mode


    The relevant part (?!(?1)) prevents skipping over first group pattern.
    This is a PCRE regex, it uses group reference and possessive quantifier.