Search code examples

PHP Regex for .vtt files

I am looking to loop through existing .vtt files and read the cue data into a database.

The format of the .vtt files are:


00:00:00.000 --> 00:00:10.000

00:00:10.000 --> 00:00:20.000
Other stuff
Example with 2 lines

00:00:20.00 --> 00:00:30.000
Example with only 2 digits in milliseconds

00:00:30.000 --> 00:00:40.000
Different stuff

00:00:40.000 --> 00:00:50.000
Example without a head line

Originally I was trying to use ^ and $ to be quite regimented with the lines along the lines of: /^(\w*)$^(\d{2}):(\d{2}):(\d{2})\.(\d{2,3}) --> (\d{2}):(\d{2}):(\d{2})\.(\d{2,3})$^(.+)$/ims but I struggled to get this working in the regex checker and resorted to using \s to deal with line start/ends.

Currently I am using the following regex: /(.*)\s(\d{2}):(\d{2}):(\d{2})\.(\d{2,3}) --> (\d{2}):(\d{2}):(\d{2})\.(\d{2,3})\s(.+)/im

This partially works using online regex checkers like: (this example does not pick up multi-line subtitles, but does get the first line which at this point is good enough for my purpose as all subtitles are currently 1 liners). However if I put this into php (preg_match_all("/(.*)\s(\d{2}):(\d{2}):(\d{2})\.(\d{2,3}) --> (\d{2}):(\d{2}):(\d{2})\.(\d{2,3})\s(.+)/mi", $fileData, $matches)) and dump the results I get an array of empty arrays.

What might be different between the online regex and php?

Thanks in advance for any suggestions.

EDIT--- Below is a dump of $fileData and a dump of $matches:

string(341) "WEBVTT FILE

00:00:00.000 --> 00:00:10.000

00:00:10.000 --> 00:00:20.000
Other stuff
Example with 2 lines

00:00:20.00 --> 00:00:30.000
Example with only 2 digits in milliseconds

00:00:30.000 --> 00:00:40.000
Different stuff

00:00:40.000 --> 00:00:50.000
Example without a head line"

array(11) {
        array(0) {}
        array(0) {}
        array(0) {}
        array(0) {}
        array(0) {}
        array(0) {}
        array(0) {}
        array(0) {}
        array(0) {}
        array(0) {}
        array(0) {}


  • The problem with your regular expression is poor line-ending handling.

    You have this at the end: \s(.+)/mi.
    This only matches 1 whitespace, but newlines can be 1 or 2 whitespaces.

    To fix it, you can use \R(.+)/mi.

    It works on the website because it is normalizing your newlines into Linux-style newlines.
    That is, Windows-style newlines are \r\n (2 characters) and Linux-style are \n (1 character).

    Alternativelly, you can try this regular expression:


    It looks horrible, but it works.
    Note: I'm swapping between \R and \r\n because \R matches the literal R inside [].

    The data is captured like this:

    1. Line number (if present)
    2. Initial timestamp
    3. Final timestamp
    4. Multiline text

    You can try it on

    You can use the handy code generator tool to generate the following PHP:

    $re = '/(?:line(\d+)\R)?(\d{2}(?::\d{2}){2}\.\d{2,3})\s*-->\s*(\d{2}(?::\d{2}){2}\.\d{2,3})\R((?:[^\r\n]|\r?\n[^\r\n])*)(?:\r?\n\r?\n|$)/i';
    $str = 'WEBVTT FILE
    00:00:00.000 --> 00:00:10.000
    00:00:10.000 --> 00:00:20.000
    Other stuff
    Example with 2 lines
    00:00:20.00 --> 00:00:30.000
    Example with only 2 digits in milliseconds
    00:00:30.000 --> 00:00:40.000
    Different stuff
    00:00:40.000 --> 00:00:50.000
    Example without a head line';
    preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
    // Print the entire match result

    You can test it on