Search code examples
rregextelevision

R regular expression to extract TV show name from text file


I'm trying to extract TV show name from txt file using R.

I have loaded the txt and assigned it to a variable called txt. Now I'm trying to use regular expression to extract just the information I want.

The pattern of information I want to extract is likes of

SHOW: Game of Thrones 7:00 PM EST
SHOW: The Outsider 3:00 PM EST
SHOW: Don't Be a Menace to South Central While Drinking Your Juice In The Hood 10:00 AM EST

and so on. There are about 320 shows and I want to extract all 320 of them.

So far, I've come up with this.

pattern <- "SHOW:\\s\\w*"
str_extract_all(txt, pattern3)

However, it doesn't extract the entire title name like I intended. (ex: it will extract "SHOW: Game" instead of "SHOW: Game of Thrones". If I were to extract that one show, I would just use "SHOW:\\s\\w*\\s\\w*\\s\\w* to match the word count, but I want to extract all shows in txt, including the longer and shorter names.

How should I write the regular expression to get the intended result?


Solution

  • You could get the value without using lookarounds by matching SHOW: and capturing the data in group 1 matching as least as possible until the first occurrence of AM or PM.

    \bSHOW:\s+(.*?)\s+\d{1,2}:\d{1,2}\s+[AP]M\b
    

    Explanation

    • \bSHOW:\s+ A word boundary, match SHOW: and 1+ whitspace chars
    • (.*?) Capture group 1, match as least as possible (non greedy)
    • \s+\d{1,2}:\d{1,2} Match 1+ whitespace chars, 1-2 digits : 1-2 digits
    • \s+[AP]M\b Match 1+ whitespace chars followed by either AM or PM and a word boundary

    Regex demo | R demo

    library(stringr)
    
    txt <- c("SHOW: Game of Thrones 7:00 PM EST", "SHOW: The Outsider 3:00 PM EST", "SHOW: Don't Be a Menace to South Central While Drinking Your Juice In The Hood 10:00 AM EST")
    pattern <- "\\bSHOW:\\s+(.*?)\\s+\\d{1,2}:\\d{1,2}\\s+[AP]M\\b"
    str_match(txt, pattern)[,2]
    

    Output

    [1] "Game of Thrones"                                                         
    [2] "The Outsider"                                                            
    [3] "Don't Be a Menace to South Central While Drinking Your Juice In The Hood"
    

    If you want to include SHOW, it can be part of the capturing group.

    \b(SHOW:.*?)\s+\d{1,2}:\d{1,2}\s+[AP]M\b
    

    Regex demo