Search code examples
regexregex-lookarounds

Sentence case in wiki articles using AutoWikiBrowser


I am trying to put the text of articles of a wiki in sentence case with AutoWikiBrowser (AWB) (an automated editor that handles regexes find and replace, but not all functions are available).

The problem is that wikicode uses many different tags to format text and there are also templates (inside double curly brakets), images (inside [[File:image.png|Caption]]) and categories (inside [[Category:Category name]]) that should stay untouched. Acronyms should stay in capitals too.

Section titles (inside two to five equal signs) should be put in sentence case, and words in links (inside double square brakets) should be treated as normal text.

I am having trouble because I'm not familiar with positive/negative lookahead/lookbehind, and I cannot find a way to make the regex simple, without having to write all the possible syntaxes.

Also, if it is possible for AWB to do several regexes in a row, it is not really feasible here, because for example I cannot make a regex to match words in the main text AND not inside templates (or I didn't find a way to do so).

Note that the case modifier \L doesn't work in AWB, but it can be replaced by {{subst:lg:}}, so don't mind about it and use \L in your examples, I'll adapt the code myself. Some tokens like \h doesn't work either, unfortunately I don't know what regex library AWB is using.

Here is an example of an article I want to edit, I want to match only the Yes:

No Yes no Yes. No, Yes.
== No Yes no ==
==== No Yes ====
[[No Yes]] Yes Yes no no no.
No no [[NO Yes]].
'''[[No Yes]]'''
''[[No Yes]]'' no Yes ''[[Yes Yes]]'' no ''Yes no''.
{{No:No|No No}}
* No Yes.
* '''No Yes'''.
* [[No Yes]].
* '''[[No Yes]]'''.
# No Yes.
#** No Yes.
#: No Yes.
No no no [[File:No.png|No No]] Yes [[Yes Yes]].
[[Category:No No]]

For example, I tried to use this regex: Find: (?<!\n|\. |\[\[Category:|\[\[File:)(?<!\{\||\{\{|^\[\[|<!--|^== |^=== |^==== |^===== |^''|^''')(\b[A-Z][a-z]*\b)(?![\w\s]*[\}|}]|[\w\s]*-->) Replace: \L$1

But it seems overly complicated and it's not working like I'd like too.

Sorry if this seems complicated, but I'm trying for two days now and I seem to be running in circles...

[EDIT]

The question has been answered blow, but I'd like to add a request: can this regex work only outisde of multiline comments (inside <!-- / --> tags) and tables (inside {| / |} tags)?

It would be even better if sentence case could be put into table cells (delimited by pipes and exclamation points), that may include links and/or italic/bold?

Here's a how it would look like:

No Yes

<!-- No No No
No No
No
-->

{| class="wikitable"
|+ No Yes
|-
! '''No Yes''' !! '''No''' !! '''[[No Yes]]'''
|-
| ''No Yes'' || ''[[No Yes Yes]]'' || ''No Yes''
|-
| No Yes || No Yes || [[No Yes Yes]]
|}

Solution

  • As the regex flavor used in AWB turns out to be .NET, you can use a regex with variable-width lookbehind patterns:

    (?m)(?!^)(?<!\.\s+|\[\[(?:Category|File):[^\]\[]*)(?<!\{\||\{\{|^(?:(?:\*\s*)?'?''|\*\s*)?\[\[|<!--|^=+\s*|^#(?::|\*+)?\s*|^\*\s*(?:''')?)\b([A-Z][a-z]*)\b(?![^{}]*}}|[\w\s]*-->)(?<!\[\[(?:Category|File)(?=:[^\]\[]*]]))
    

    See the regex demo. Details:

    • (?m) - multiline mode on
    • (?!^) - not at the start of a line
    • (?<!\.\s+|\[\[(?:Category|File):[^\]\[]*) - immediately before, there should be no . and one or more whitespaces, or [[ followed with Category or File and then : and then zero or more chars other than [ and ]
    • (?<!\{\||\{\{|^(?:(?:\*\s*)?'?''|\*\s*)?\[\[|<!--|^=+\s*|^#(?::|\*+)?\s*|^\*\s*(?:''')?) - negative lookbehind that fails the match if, immediately before, there are patterns like
      • \{\|| - {| string, or
      • \{\{| - {{ string, or
      • ^(?:(?:\*\s*)?'?''|\*\s*)?\[\[| - an optional sequence of an optional sequence of * and zero ro more spaces followed with an optional ' char and then '' or a * char followed with zero or more whitespaces, and then [[, or
      • <!--|^=+\s*|
      • ^#(?::|\*+)?\s*|
      • ^\*\s*(?:''')?
    • \b - word boundary
    • ([A-Z][a-z]*) - an uppercase letter followed with zero or more lowercase letters (in .NET, you can also use \p{Lu}\p{Ll}* to match any Unicode letters)
    • \b - word boundary
    • (?![^{}]*}}|[\w\s]*-->) - a negative lookahead: no match allowed if there are zero or more chars other than { and } and then }} or any zero or more word/whitespace chars and then -->
    • (?<!\[\[(?:Category|File)(?=:[^\]\[]*]])) - fail the match if, immediately before, there is [[, Category or File and right after, there are zero or more chars other than [ and ] and then ]].