Search code examples
regexparsingmarkdown

How to select markdown text under heading containing a keyword with regexp?


I am trying to match content under a specific heading level when title contains a [[Wikilink]] keyword (use case: Obsidian)

I want it to match lower level headings.

Example [[Wikilink]] is in an H2, then match all below until h2 or higher or end of file

Difficulty: [[Wikilink]] H-level is unknown. The regex should be able to parse multiple inconsistent files where [[Wikilink]] could be H1, H2, H3, etc.

My current regex that fails when it encounters any heading:

(^#+ )[^\[]*?\[\[Wikilink]\][^\n]*?\n([\S\s]*?)(?1)

Sandbox: https://regex101.com/r/bLdifP/1

Somehow related to this question on SO: Regex to match markdown headings and text nested under specific heading


Solution

  • Try the following regex.

    ^(#+) .*\[\[Wikilink\]\].*$(?=([\S\s]*?)(?:^(?!\1#+)#+ |\z))
    

    Regex in action: https://regex101.com/r/bLdifP/4

    The regex can be broken down as follows.

    ^                        the beginning of a "line"
    --------------------------------------------------------------------
    (                        group and capture to \1:
    --------------------------------------------------------------------
      #+                       '#' (1 or more times (matching the most
                               amount possible))
    --------------------------------------------------------------------
    )                        end of \1
    --------------------------------------------------------------------
                             ' '
    --------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
    --------------------------------------------------------------------
    \[                       '['
    --------------------------------------------------------------------
    \[                       '['
    --------------------------------------------------------------------
    Wikilink                 'Wikilink'
    --------------------------------------------------------------------
    \]                       ']'
    --------------------------------------------------------------------
    \]                       ']'
    --------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
    --------------------------------------------------------------------
    $                        before an optional \n, and the end of a
                             "line"
    --------------------------------------------------------------------
    (?=                      look ahead to see if there is:
    --------------------------------------------------------------------
      (                        group and capture to \2:
    --------------------------------------------------------------------
        [\S\s]*?                 any character of: non-whitespace (all
                                 but \n, \r, \t, \f, and " "),
                                 whitespace (\n, \r, \t, \f, and " ")
                                 (0 or more times (matching the least
                                 amount possible)).
                                 Match any character, including line
                                 breaks.
    --------------------------------------------------------------------
      )                        end of \2
    --------------------------------------------------------------------
      (?:                      group, but do not capture, equivalent 
                               to "(?>":
    --------------------------------------------------------------------
        ^                        the beginning of a "line"
    --------------------------------------------------------------------
        (?!                      look ahead to see if there is not:
    --------------------------------------------------------------------
          \1                       what was matched by capture \1
    --------------------------------------------------------------------
          #+                       '#' (1 or more times (matching the
                                   most amount possible))
    --------------------------------------------------------------------
        )                        end of look-ahead
    --------------------------------------------------------------------
        #+                       '#' (1 or more times (matching the
                                 most amount possible))
    --------------------------------------------------------------------
                                 ' '
    --------------------------------------------------------------------
       |                        OR
    --------------------------------------------------------------------
        \z                       the end of the string
    --------------------------------------------------------------------
      )                        end of grouping
    --------------------------------------------------------------------
    )                        end of look-ahead