Consider the following examples of SRT subtitle text:
1
00:00:08,181 --> 00:00:10,461
FOOTSTEPS
2
00:00:12,861 --> 00:00:17,901
<font size="36">This programme contains some scenes
which some viewers may find
upsetting from the start.</font>
...
56
00:06:01,061 --> 00:06:02,741
<font color="#ffff00">and talk shop for a change.</font>
57
00:06:02,741 --> 00:06:05,381
<font size="36">Look, I like it here, you know?</font>
58
00:06:05,381 --> 00:06:07,701
<font size="36">I want to help.
LOUD CLATTERING</font>
59
00:06:07,701 --> 00:06:09,661
<font size="36">BABY CRIES</font>
60
00:06:11,021 --> 00:06:12,981
<font color="#00ffff">Sh, sh.</font>
61
00:06:25,621 --> 00:06:27,101
<font color="#ffff00">There's that look.</font>
...
112
00:09:46,741 --> 00:09:48,501
<font size="36">OK. Where is he?</font>
...
501
00:52:04,701 --> 00:52:06,141
<font size="36">ALICE, BILL AND BEN GROAN</font>
502
00:52:07,621 --> 00:52:09,981
<font size="36">BILL: I'm looking for Karveel Street?
TOM CRIEGHTON-HUGHES CHUCKLES</font>
I am parenthesising the "meta" speech acts, represented above as CAPITALISED text. The end result will look like this (note: I'm not asking how to put parens in place, just how to capture the capitalised text as described below):
1
00:00:08,181 --> 00:00:10,461
(FOOTSTEPS)
58
00:06:05,381 --> 00:06:07,701
<font size="36">I want to help.
(LOUD CLATTERING)</font>
59
00:06:07,701 --> 00:06:09,661
<font size="36">(BABY CRIES)</font>
501
00:52:04,701 --> 00:52:06,141
<font size="36">(ALICE, BILL AND BEN GROAN)</font>
502
00:52:07,621 --> 00:52:09,981
<font size="36">BILL: I'm looking for Karveel Street?
(TOM CRIEGHTON-HUGHES CHUCKLES)</font>
To keep from matching unwanted CAPS, like "I" or acronyms in the text, it's worth noting that the text of interest always occupies the ENTIRE LINE (possibly with tags but no other speech text). Far as I can see, the only cases of interest are:
^ (at start of line)
\R (following a line return)
> (following a tag)
All terminate either with the EOL character ($) or the text "</font>"
First effort
Based on this answer for SublimeText and this one for Perl I tried these regexes:
(?<=^|\R|>)(\-\s*)*([A-Z\h\,\-]+(\R[A-Z\h\,\-]+)?)(?=$|</font>)
(?:(?<=^)|(?<=\R)|(?<=\>))(\-\s*)*([A-Z\h\,\-]+(\R[A-Z\h\,\-]+)?)(?=$|</font>)
but Notepad++ says "Invalid Regular Expression". Maybe disjunction isn't allowed in a lookbehind? Though the positive lookahead (?=$|</font>)
also uses disjunction and it does work.
Another post, recommended I separately disjunct each lookbehind like this:
(?:(?<=^)|(?<=\R)|(?<=\>))(\-\s*)*([A-Z\h\,\-]+(\R[A-Z\h\,\-]+)?)(?=$|</font>)
but Notepad++ says "Invalid Regular Expression".
My workaround regex does the job:
((?:^|\R|>)\-?\s*)([A-Z\h\,\-]+(\R[A-Z\h\,\-]+)?)(?=$|</font>)
but I want a better solution than this.
Is there any way I can use disjunction in a lookbehind in Notepad++ to achieve the result I need?
I hope someone can shed some light on what's going on. I'm using Notepad++ v8.7.5.
The issue is that the lookbehinds are fixed-width, and since ^
and \R
are 0-width while >
has 1-width this throws an error. I have verified both work in Notepad++.
Second observation (and if I am thinking wrongly about this I would like to know) ^
and \R
both check for the same, so I chose to only keep ^
.
I propose two alternatives:
((?<=^)|(?<=>))[A-Z\,\- ]+(?=$|</font>)
( ... | ... )
: Gives the alternation of possible lookbehinds that assert, the text to be eigher
(?<=^)
: behind the start of line or(?<=>)
: behind a literal ">".(?<=(?=[\s\S]^|>)[\s\S])[A-Z\,\- ]+(?=$|</font>)
(?<= ... [\s\S])
: Look behind one character
(?= ... | ... )
: and assert that that character is either
[\s\S]^
: any character followed by the start of line or>
: literal >
.