I'm trying to clean YouTube video title from unnecessary words such as "Official Video", "Audio", "Music Video" etc. I need help constructing regex that I can use. What I tried so far:
const regex = /\s*[-\(\[]?\s*(-|official|video|audio|lyrics|lyric|hd|full|4k|music\s+video|\d{4})\s*[\)\]]?$/gi;
As I understand, this would remove only last occurrence of keywords. What I did is that I used it in a loop like this:
function clearSearchTerm(title) {
const regex = /\s*[-\(\[]?\s*(-|official|video|audio|lyrics|lyric|hd|full|4k|music\s+video|\d{4})\s*[\)\]]?$/gi;
let newTitle;
do {
newTitle = title;
title = title.replace(regex, "");
} while (newTitle !== title);
return title;
}
Right now it works for me since I didn't find any example where it doesn't work. What was mentioned in comments is that I had problem that my previous regex would remove keywords if they appeared in middle of title which I guess is solved with this. If you have any idea how this can be improved, I'm all ears. In next part I will write examples of what I need to remove.
Words that I'm trying to remove are of kind:
Audio
Video
Lyrics
Official
Remaster
2020 (or years in general)
...
And all those words (and maybe more) can appear between (
and )
or between [
and ]
or after -
. Those words can be combined, for example: Some title - Official Video
which should be cleaned to be Some title
etc.
With PCRE (typically in PHP), you can avoid the repetition of words by declaring a sub-pattern and then reuse it later in the main pattern. It's also possible to add comments and spaces for readability with the x flag:
/
(?(DEFINE)
(?<words_to_drop>
(?:
\s*
\b(?:Official|Video|Audio|Music|Lyrics?|Remaster(?:ed)?|HD|LP|HQ|4k|Full|Version)\b
\s*
)+
)
)
# Finishing by - and words to remove (but not years).
\s+[-–]\s+\g<words_to_drop>$
| # or
# Words or years to remove between brackets or parenthesis.
\s*[[(](?:\g<words_to_drop>|\s*\d{4}\s*)+[\])]
/ix
See it in action with the explanation: https://regex101.com/r/kPeYzb/1
If you have to stick to JavaScript's engine, you'll have to remove the spaces, comments and copy-paste the pattern for the words, leading to the same pattern, in JavaScript flavour:
const pattern = /\s+[-–]\s+(?:\s*\b(?:Official|Video|Audio|Music|Lyrics?|Remaster(?:ed)?|HD|LP|HQ|4k|Full|Version)\b\s*)+$|\s*[[(](?:(?:\s*\b(?:Official|Video|Audio|Music|Lyrics?|Remaster(?:ed)?|HD|LP|HQ|4k|Full|Version)\b\s*)+|\s*\d{4}\s*)+[\])]/gi;
In action here: https://regex101.com/r/kPeYzb/2
Now, about your question of avoiding having this list of
words entered twice in the regex, it is possible to create
the regex from a string, with the RegExp()
constructor. This
means that you could have an array of words (or word regexes)
from a configuration:
const input = document.getElementById('input');
const output = document.getElementById('output');
// Original commented regular expression : https://regex101.com/r/kPeYzb/1
// We will build this regular expression from a custom list of words,
// for example taken from a configuration page.
const wordsToRemove = [
'Official',
'Video',
'Audio',
'Music',
'Lyrics?',
'Remaster(?:ed)?',
'HD',
'LP',
'HQ',
'4k',
'Full',
'Version'
];
// IMPORTANT: compared to the regex syntax, if we build a RegExp instance
// from a string, each backslash should be escaped.
// The regex to match multiple words from this list of words to remove.
const regexWordsToRemove = '(?:\\s*\\b(?:' + wordsToRemove.join('|') + ')\\b\\s*)+';
// The full regex pattern.
const patternCleanup = '\\s+[-–]\\s+' + regexWordsToRemove + '$|\\s*[[(](?:' + regexWordsToRemove + '|\\s*\\d{4}\\s*)+[\\])]';
// Create the regex object.
const regexCleanup = new RegExp(patternCleanup, 'gmi');
// Printing it should give the same result as the original regex we
// made here: https://regex101.com/r/kPeYzb/2
console.log(regexCleanup);
function updateOutput() {
output.value = input.value.replace(regexCleanup, '');
}
document.addEventListener('DOMContentLoaded', (loaded) => {
// When the input changes, update the output text.
input.addEventListener('input', updateOutput);
// Update the output for the initial input value.
updateOutput();
});
body {
font-family: Arial, sans-serif;
}
.two-cols {
display: grid;
grid-template-columns: 1fr 1fr;
grid-column-gap: .5em;
}
textarea {
/* Just because the snippet space is small. */
font-size: 0.8em;
/* Don't wrap the text, to make comparaison easier. */
white-space: pre;
overflow-wrap: normal;
overflow-x: scroll;
box-sizing: border-box;
width: 100%;
}
textarea[readonly] {
color: #666;
background: #f8f8f8;
}
<form id="clean-up" class="two-cols" action="#">
<div>
<label for="input">Input:</label>
<textarea id="input" name="input"
placeholder="Put your text here"
rows="10">Some title - Official Video
Some title [Official Video]
Some title (Official Video)
The Buggles - Video killed the Radio Star
The Smashing Pumpkins - 1979 (Official Music Video)
The Smashing Pumpkins – 1979
1979 (Remastered 2012)
New Order – 1963 (Lyrics)
Paul Davis - '65 Love Affair (1981 LP Version HQ)
Pulp - Disco 2000</textarea>
</div>
<div>
<label for="output">Output: <small>Automatically updated</small></label>
<textarea id="output" name="output"
placeholder="Modified text" readonly
rows="10"></textarea>
</div>
</form>