Search code examples
.netregexwebvtt

Extracting from webvtt using regex


I am trying to build a regex to use in a .Net environment that will allow me to extract information from webvtt files.

I want to extract the timecode information and the corresponding information from the next line(s) that may be subtitling, or could be something else. The problem that I have run into is that the information on the next line(s) is sometimes one line, other times spans multiple lines eg:

00:00:02.736 --> 00:00:06.072 line:79.33% position:10.00% align:start 
   AND YOUR GRACE?

00:00:06.072 --> 00:00:08.875 line:74.00% position:10.00% align:start 
  WHAT WILL YOU DO
     ABOUT THAT?

and I need to make sure that I get all of it, without inadvertently running into the start of the next group.

I've tried this:

\n(\d{2}:\d{2}:\d{2}.\d{3})(.|\n)*(?<!\d{2}:\d{2}:\d{2}.\d{3})

the idea being that it gets the first timecode and everything after but stops at the next occurrence of the first timecode again, but it captures the whole file.

I've also tried:

(?<!WEBVTT)(\d{2}:\d{2}:\d{2}.\d{3}).*?(\d{2}:\d{2}:\d{2}.\d{3}).*\n([^\n]+\n)*[^\n]+

I realise that the negative lookahead is redundant at the start. Here I am trying to put the timecodes into separate groups, ignore the rest of that line and then capture everything from the new line on but this is skipping subtitle text and not spanning multiple lines.

The problem I seem to be having is that I either capture too many lines, or not enough.

Is there a way to tell regex to match something (eg the first timecode) and everything after it, then start again when the first match is hit?

I'm sure this must be possible but I am new to using regex so I'm finding it difficult. I don't mind if I have to break it up into more than operation to get the desired result.

So what I am trying to get is along the lines of:

first group either:

00:00:02.736

or

00:00:02.736 --> 00:00:06.072

second (or third depending on the above):

AND YOUR GRACE?

then:

00:00:06.072 --> 00:00:08.875

followed by:

WHAT WILL YOU DO
 ABOUT THAT?

etc


Solution

  • It seems you may use

    (?m)^(\d{2}:\d{2}:\d{2}\.\d+) +--> +(\d{2}:\d{2}:\d{2}\.\d+).*[\r\n]+\s*(?s)((?:(?!\r?\n\r?\n).)*)
    

    See the regex demo

    Details

    • (?m) - MULTILINE mode on
    • ^ - start of a line (due to (?m))
    • (\d{2}:\d{2}:\d{2}\.\d+) - Group 1: a timestamp pattern
    • +--> + - 1+ spaces, -->, 1+ spaces
    • (\d{2}:\d{2}:\d{2}\.\d+) - Group 2: a timestamp pattern
    • .*[\r\n]+\s* - the rest of the line (.*), 1+ linebreak chars ([\r\n]+) and then 0+ whitespaces (\s*)
    • (?s) - a DOTALL enabled from now on (. matches newlines)
    • ((?:(?!\r?\n\r?\n).)*) - Group 3: any char that is not starting a double line break sequence, 0+ times.

    enter image description here