Search code examples
regexpython-2.7regex-group

Match movie filenames with optional parts with regex


I have a film title in the following format

(Studio Name) - Film Title Part-1** - Animation** (2014).mp4

The part in BOLD is optional, meaning I can have a title such as this

(Studio Name) - Film Title Part-1 (2014).mp4

With this regex

^\((?P<studio>.+)\) - (?P<title>.+)(?P<genre>-.+)\((?P<year>\d{4})\)

I get the following results

studio = Studio Name
title  = Film Title Part-1
genre  = - Animation
year   = 2014

I have tried the following to make the "- Animation" optional by changing the regex to

^\((?P<studio>.+)\) - (?P<title>.+)(?:(?P<genre>-.+)?)\((?P<year>\d{4})\)

but I end up with the following results

studio = Studio Name
title  = Film Title Part-1 - Animation
genre  = 
year   = 2014

I am using Python, the code that I am executing to process the regex is

pattern = re.compile(REGEX) 
matched = pattern.search(film)

Solution

  • You can omit the non capturing group around the genre, make change the first .* to a negated character class [^()] matching any char except parenthesis and make the .+ in greoup title non greedy to allow matching the optional genre group.

    For the genre, you could match .+, or make the match more specific if you only want to match a single word.

    ^\((?P<studio>[^()]+)\) - (?P<title>.+?)(?P<genre>- \w+ )?\((?P<year>\d{4})\)
    

    Regex demo

    Explanation

    • ^ Start of string
    • \((?P<studio>[^()]+)\) Named group studio match any char except parenthesis between ( and )
    • - Match literally
    • (?P<title>.+?) Named group title, match any char except a newline as least as possible
    • (?P<genre>- \w+ )? Named group genre, match - space, 1+ word chars and space
    • \((?P<year>\d{4})\) named group year, match 4 digits between ( and )

    If you want to match the whole line:

    ^\((?P<studio>[^()]+)\) - (?P<title>.+?)(?P<genre>- \w+ )?\((?P<year>\d{4})\)\.mp4$