Search code examples
javascriptregex

Regex to remove substrings such as "Official Video", "Audio", "Music Video"... from string


I'm trying to clean YouTube video title from unnecessary words such as "Official Video", "Audio", "Music Video" etc. I need help constructing regex that I can use. What I tried so far:

const regex = /\s*[-\(\[]?\s*(-|official|video|audio|lyrics|lyric|hd|full|4k|music\s+video|\d{4})\s*[\)\]]?$/gi;

As I understand, this would remove only last occurrence of keywords. What I did is that I used it in a loop like this:

function clearSearchTerm(title) {
    const regex = /\s*[-\(\[]?\s*(-|official|video|audio|lyrics|lyric|hd|full|4k|music\s+video|\d{4})\s*[\)\]]?$/gi;
    let newTitle;

    do {
        newTitle = title;
        title = title.replace(regex, "");
    } while (newTitle !== title);

    return title;
}

Right now it works for me since I didn't find any example where it doesn't work. What was mentioned in comments is that I had problem that my previous regex would remove keywords if they appeared in middle of title which I guess is solved with this. If you have any idea how this can be improved, I'm all ears. In next part I will write examples of what I need to remove.

Words that I'm trying to remove are of kind:

Audio
Video
Lyrics
Official
Remaster
2020 (or years in general)
...

And all those words (and maybe more) can appear between ( and ) or between [ and ] or after -. Those words can be combined, for example: Some title - Official Video which should be cleaned to be Some title etc.


Solution

  • With PCRE (typically in PHP), you can avoid the repetition of words by declaring a sub-pattern and then reuse it later in the main pattern. It's also possible to add comments and spaces for readability with the x flag:

    /
    (?(DEFINE)
      (?<words_to_drop>
        (?:
          \s*
          \b(?:Official|Video|Audio|Music|Lyrics?|Remaster(?:ed)?|HD|LP|HQ|4k|Full|Version)\b
          \s*
        )+
      )
    )
    # Finishing by - and words to remove (but not years).
    \s+[-–]\s+\g<words_to_drop>$
    | # or
    # Words or years to remove between brackets or parenthesis.
    \s*[[(](?:\g<words_to_drop>|\s*\d{4}\s*)+[\])]
    /ix
    

    See it in action with the explanation: https://regex101.com/r/kPeYzb/1

    If you have to stick to JavaScript's engine, you'll have to remove the spaces, comments and copy-paste the pattern for the words, leading to the same pattern, in JavaScript flavour:

    const pattern = /\s+[-–]\s+(?:\s*\b(?:Official|Video|Audio|Music|Lyrics?|Remaster(?:ed)?|HD|LP|HQ|4k|Full|Version)\b\s*)+$|\s*[[(](?:(?:\s*\b(?:Official|Video|Audio|Music|Lyrics?|Remaster(?:ed)?|HD|LP|HQ|4k|Full|Version)\b\s*)+|\s*\d{4}\s*)+[\])]/gi;
    

    In action here: https://regex101.com/r/kPeYzb/2

    Now, about your question of avoiding having this list of words entered twice in the regex, it is possible to create the regex from a string, with the RegExp() constructor. This means that you could have an array of words (or word regexes) from a configuration:

    const input = document.getElementById('input');
    const output = document.getElementById('output');
    
    // Original commented regular expression : https://regex101.com/r/kPeYzb/1
    
    // We will build this regular expression from a custom list of words,
    // for example taken from a configuration page.
    const wordsToRemove = [
      'Official',
      'Video',
      'Audio',
      'Music',
      'Lyrics?',
      'Remaster(?:ed)?',
      'HD',
      'LP',
      'HQ',
      '4k',
      'Full',
      'Version'
    ];
    // IMPORTANT: compared to the regex syntax, if we build a RegExp instance
    //            from a string, each backslash should be escaped.
    // The regex to match multiple words from this list of words to remove.
    const regexWordsToRemove = '(?:\\s*\\b(?:' + wordsToRemove.join('|') + ')\\b\\s*)+';
    // The full regex pattern.
    const patternCleanup = '\\s+[-–]\\s+' + regexWordsToRemove + '$|\\s*[[(](?:' + regexWordsToRemove + '|\\s*\\d{4}\\s*)+[\\])]';
    // Create the regex object.
    const regexCleanup = new RegExp(patternCleanup, 'gmi');
    // Printing it should give the same result as the original regex we
    // made here: https://regex101.com/r/kPeYzb/2
    console.log(regexCleanup);
    
    function updateOutput() {
      output.value = input.value.replace(regexCleanup, '');
    }
    
    document.addEventListener('DOMContentLoaded', (loaded) => {
      // When the input changes, update the output text.
      input.addEventListener('input', updateOutput);
      
      // Update the output for the initial input value.
      updateOutput();
    });
    body {
      font-family: Arial, sans-serif;
    }
    
    .two-cols {
      display: grid;
      grid-template-columns: 1fr 1fr;
      grid-column-gap: .5em;
    }
    
    textarea {
      /* Just because the snippet space is small. */
      font-size: 0.8em;
      /* Don't wrap the text, to make comparaison easier. */
      white-space: pre;
      overflow-wrap: normal;
      overflow-x: scroll;
      box-sizing: border-box;
      width: 100%;
    }
    
    textarea[readonly] {
      color: #666;
      background: #f8f8f8;
    }
    <form id="clean-up" class="two-cols" action="#">
    
      <div>
        <label for="input">Input:</label>
        <textarea id="input" name="input"
                  placeholder="Put your text here"
                  rows="10">Some title - Official Video
    Some title [Official Video]
    Some title (Official Video)
    The Buggles - Video killed the Radio Star
    The Smashing Pumpkins - 1979 (Official Music Video)
    The Smashing Pumpkins – 1979
    1979 (Remastered 2012)
    New Order – 1963 (Lyrics)
    Paul Davis - '65 Love Affair (1981 LP Version HQ)
    Pulp - Disco 2000</textarea>
      </div>
      
      <div>
        <label for="output">Output: <small>Automatically updated</small></label>
        <textarea id="output" name="output"
                  placeholder="Modified text" readonly
                  rows="10"></textarea>
      </div>
      
    </form>