Search code examples
regexregex-groupregex-negationregexp-replace

Javascript Regex to match expression except another expression


First I am going to describe what I am trying to achieve. I want to copy paste a list of football games (as a regular user, not as a dev, so the plain text from a website, not the html from inspecting the html), therefor I have to parse the text. So on the website it looks like this:

enter image description here

The pasted text would look like:

PERU\r\nLiga 2\r\nClasament Live\r\nFinal\r\nSanta Rosa\r\n\r\n0\r\n - \r\n3\r\n\r\nMolinos El Pirata\r\n(0 - 1)\r\n73 \r\nChavelines\r\n\r\n1\r\n - \r\n0\r\n\r\nDeportivo Coopsol\r\n(1 - 0)\r\n20:30\r\nComerciantes Unidos\r\n\r\n-\r\n\r\nJuan Aurich\r\n22:45\r\nSantos FC\r\n\r\n-\r\n\r\nHuaral\r\nPOLONIA\r\nEkstraklasa\r\nClasament Live\r\n90+1 \r\nPogon Szczecin\r\n\r\n2\r\n - \r\n0\r\n\r\nStal Mielec\r\n(1 - 0)\r\nlive\r\n20:30\r\nPlock\r\n\r\n-\r\n\r\nGornik Z.\r\nPORTUGALIA\r\nPrimeira Liga\r\nClasament\r\n21:15\r\nFarense\r\n\r\n-\r\n\r\nMaritimo

And what I need is then to build something like this:

Final     Santa Rosa            0 - 3  Molinos El Pirata
75        Chavelines            1 - 0  Deportivo Coopsol
20:30     Comerciantes Unidos     -    Juan Aurich
22:45     Santos FC               -    Huaral
90+3      Pogon Szczecin        2 - 0  Stal Mielec
20:30     Plock                   -    Gornik Z.
21:15     Farense                 -    Maritimo

So the plan is to extract each individual line into an array and then put them in a table. I am first cleaning up the text I don't need (the country names, league name, half time score:

gamesUnformatted = gamesUnformatted.replace(/\b[A-Z]{5,}\b/g, '['); // replace the country name (names with more than 4 letters, to avoid removing LASK, TSKA... but it will remove IRAN, ASIA - find better way) which is in capital letters with [
gamesUnformatted = gamesUnformatted.replace(/Clasament Live/g, ']');
gamesUnformatted = gamesUnformatted.replace(/Clasament/g, ']'); // replace the words Clasament with ]
gamesUnformatted = gamesUnformatted.replace(/ *\[[^\]]*]/g, ''); // remove everything between [ and ], including the square brackets
gamesUnformatted = gamesUnformatted.replace(/\(\d{1,2} - \d{1,2}\)/g, ''); // remove half time score eg (0 - 0)

And now I want to add the word newLine in front of every line, so that later I can just split by "newLine" and have all the independent lines in the array. And there are three scenarios for where a line starts: if the game didn't start (20:30), if the game has ended (Final) or if the game is running (eg 70). For the first two I have the following:

gamesUnformatted = gamesUnformatted.replace(/\d{2}:\d{2}/g, 'newLine$&'); // add the word newLine in front of the starting hours
gamesUnformatted = gamesUnformatted.replace(/Final/g, 'newLine$&'); // add the word newLine in front of the word Final (game has ended)

But the third one is more tricky. There can be 0-90, then with extra time 90+something (eg 90+3), and then can be two extra halves (eg 120, 120+..). So this is where I need some help. I need a regex that would match all these scenarios, but exclude others. To be more precise, I need to match the minute (1-120 and 1-120+...) but not the score or the hour (1-0, 20:30). and I have tried all sorts of things for half a day, can't list them all here, but have tried things with ^ and with ?: and with ! and what not. I must say I am not good with regex, so probably most of the things I have tried were silly, but ok, what I have at this moment is this:

gamesUnformatted = gamesUnformatted.replace(/\d{1,3}[^(\d{2}:\d{2})]/g, 'newLine$&');

This would be just the first step, to replace any number with 1 to 3 digits, not considering the "90+4". and trying to ignore the hours, not the scores. But this is not working well, because it is adding newLine in front of every digit. So this:

90+3      Pogon Szczecin        2 - 0 Stal Mielec
20:30     Plock                   - Gornik Z.

becomes this:

newLine90+newLine3      Pogon Szczecin        newLine2 - newLine0 Stal Mielec
newLinenewLine20:newLine30     Plock                   - Gornik Z.

instead of this (on the second row there are two newLine because one was added before to the hours, so that must be ignored):

newLine90+3      Pogon Szczecin        2 - 0 Stal Mielec
newLinenewLine20:30     Plock                   - Gornik Z.

Solution

  • Before adding the new line, you can do one more replacement to ensure the scores are in 1 line like 0-1

    Demo: https://regex101.com/r/Dur5lD/4

    Pattern: Match: (\d{1,2})\s*-\s*(\d{1,2}); Replacement: $1-$2

    Explanation: Since I have newlines in the text, I have used \s to match space sequences. Used the capturing group $1 and $2 to get the desired output.


    Once we have done this, adding newline should be straight forward.

    Demo: https://regex101.com/r/Dur5lD/5

    Pattern: ^((?:Final)|(?:\d{2}:\d{2})|(?:\d{1,3}(?!\d)(?!-)))

    Explanation:

    • Capture a group that can be one of Final or hour or time.
    • For matching time, use negative look ahead, (?!). It implies the time value like 70 or 120 should not be followed by - or another digit.

    Note:

    • I assumed \r\n as new line characters. If not we might need to replace \s and ^ characters in the expression with literal \r\n.
    • Looks like your regex is not handling PERU, so I manually removed the line.
    • After replacing \n with \t and then replacing newLine with \n yielded https://regex101.com/r/Dur5lD/6.