Search code examples
regexnotepad++pcreemeditor

Match strings between specific tags and convert to wikilinks


This question is similar to another one I asked here: Match strings between delimiting characters but I could not modify in order to perform the new task. (Solution should work with EmEditor or Notepad++)

I need to match text between specific tags, I.e. <b class="b2">I have a lot of text, more text, some more text, text</b> and then

  1. Convert first character only after opening tag to lowercase (with the exception of the pronoun "I")
  2. Convert content between commas to wikilinks (and eliminate the tag).

I have tried running a number of regexes to get close to this with multiple steps, i.e.

(<b class="b2">)(.)
[[\L\2

</b>
]]

(\[\[)(\w+), (\w+)(\]\])
\1\2]], [[\3\4

Input text:

Any text <b class="b2">I make laugh</b>: Ar. and P. γέλωτα. Some more text <b class="b2">Delight</b>: P. and V. [[τέρπω]].
Any text <b class="b2">I amuse oneself, pass the time</b>: P. διάγειν.
Any text <b class="b2">It amuses oneself with, pass the time over, amuse</b>: Ar. and P.

Expected output:

Any text [[I make laugh]]: Ar. and P. γέλωτα. Some more text [[delight]]: P. and V. [[τέρπω]].
Any text [[I amuse oneself]], [[pass the time]]: P. διάγειν.
Any text [[it amuses oneself with]], [[pass the time over]], [[amuse]]: Ar. and P.

Solution

  • This a one-step solution:

    • Ctrl+H
    • Find what: (?:<b class="b2">|\G(, (?=.*</b>)))(I )?([^,<]+)(?:</b>)?
    • Replace with: $1[[$2\l$3]]
    • check Wrap around
    • check Regular expression
    • UNCHECK . matches newline
    • Replace all

    Explanation:

    (?:                 # non capture group
        <b class="b2">  # literally
      |                 # OR
        \G              # restart from last match position
        (               # group 1, a comma and a space
          ,             # a comma and a space
        (?=.*</b>)      # positive look ahead, make sure we have a closing tag after
        )               # end group 1
    )                   # end group
    (I )?               # group 2, UPPER I and a space, optional
    ([^,<]+)            # group 3, 1 or more any character that is not comma or less than
    (?:</b>)?           # optional end tag
    

    Replacement:

    $1          # content og group 1 (i.e. comma & space)
    [[          # double openning square bracket
    $2          # content of group 2, (i.e. "I ")
    \l$3        # lowercase the first letter of group 3 (i.e. all character until comma or end tag)
    ]]          # double closing square bracket
    

    Result for given example:

    Any text [[I make laugh]]: Ar. and P. γέλωτα. Some more text [[delight]]: P. and V. [[τέρπω]].
    Any text [[I amuse oneself]], [[pass the time]]: P. διάγειν.
    Any text [[it amuses oneself with]], [[pass the time over]], [[amuse]]: Ar. and P.
    [[be at ease]], v.: P. and V. ἡσυχάζειν, V. ἡσύχως ἔχειν.
    

    Screen capture:

    enter image description here