First I am going to describe what I am trying to achieve. I want to copy paste a list of football games (as a regular user, not as a dev, so the plain text from a website, not the html from inspecting the html), therefor I have to parse the text. So on the website it looks like this:
The pasted text would look like:
PERU\r\nLiga 2\r\nClasament Live\r\nFinal\r\nSanta Rosa\r\n\r\n0\r\n - \r\n3\r\n\r\nMolinos El Pirata\r\n(0 - 1)\r\n73 \r\nChavelines\r\n\r\n1\r\n - \r\n0\r\n\r\nDeportivo Coopsol\r\n(1 - 0)\r\n20:30\r\nComerciantes Unidos\r\n\r\n-\r\n\r\nJuan Aurich\r\n22:45\r\nSantos FC\r\n\r\n-\r\n\r\nHuaral\r\nPOLONIA\r\nEkstraklasa\r\nClasament Live\r\n90+1 \r\nPogon Szczecin\r\n\r\n2\r\n - \r\n0\r\n\r\nStal Mielec\r\n(1 - 0)\r\nlive\r\n20:30\r\nPlock\r\n\r\n-\r\n\r\nGornik Z.\r\nPORTUGALIA\r\nPrimeira Liga\r\nClasament\r\n21:15\r\nFarense\r\n\r\n-\r\n\r\nMaritimo
And what I need is then to build something like this:
Final Santa Rosa 0 - 3 Molinos El Pirata
75 Chavelines 1 - 0 Deportivo Coopsol
20:30 Comerciantes Unidos - Juan Aurich
22:45 Santos FC - Huaral
90+3 Pogon Szczecin 2 - 0 Stal Mielec
20:30 Plock - Gornik Z.
21:15 Farense - Maritimo
So the plan is to extract each individual line into an array and then put them in a table. I am first cleaning up the text I don't need (the country names, league name, half time score:
gamesUnformatted = gamesUnformatted.replace(/\b[A-Z]{5,}\b/g, '['); // replace the country name (names with more than 4 letters, to avoid removing LASK, TSKA... but it will remove IRAN, ASIA - find better way) which is in capital letters with [
gamesUnformatted = gamesUnformatted.replace(/Clasament Live/g, ']');
gamesUnformatted = gamesUnformatted.replace(/Clasament/g, ']'); // replace the words Clasament with ]
gamesUnformatted = gamesUnformatted.replace(/ *\[[^\]]*]/g, ''); // remove everything between [ and ], including the square brackets
gamesUnformatted = gamesUnformatted.replace(/\(\d{1,2} - \d{1,2}\)/g, ''); // remove half time score eg (0 - 0)
And now I want to add the word newLine in front of every line, so that later I can just split by "newLine" and have all the independent lines in the array. And there are three scenarios for where a line starts: if the game didn't start (20:30), if the game has ended (Final) or if the game is running (eg 70). For the first two I have the following:
gamesUnformatted = gamesUnformatted.replace(/\d{2}:\d{2}/g, 'newLine$&'); // add the word newLine in front of the starting hours
gamesUnformatted = gamesUnformatted.replace(/Final/g, 'newLine$&'); // add the word newLine in front of the word Final (game has ended)
But the third one is more tricky. There can be 0-90, then with extra time 90+something (eg 90+3), and then can be two extra halves (eg 120, 120+..). So this is where I need some help. I need a regex that would match all these scenarios, but exclude others. To be more precise, I need to match the minute (1-120 and 1-120+...) but not the score or the hour (1-0, 20:30). and I have tried all sorts of things for half a day, can't list them all here, but have tried things with ^ and with ?: and with ! and what not. I must say I am not good with regex, so probably most of the things I have tried were silly, but ok, what I have at this moment is this:
gamesUnformatted = gamesUnformatted.replace(/\d{1,3}[^(\d{2}:\d{2})]/g, 'newLine$&');
This would be just the first step, to replace any number with 1 to 3 digits, not considering the "90+4". and trying to ignore the hours, not the scores. But this is not working well, because it is adding newLine in front of every digit. So this:
90+3 Pogon Szczecin 2 - 0 Stal Mielec
20:30 Plock - Gornik Z.
becomes this:
newLine90+newLine3 Pogon Szczecin newLine2 - newLine0 Stal Mielec
newLinenewLine20:newLine30 Plock - Gornik Z.
instead of this (on the second row there are two newLine because one was added before to the hours, so that must be ignored):
newLine90+3 Pogon Szczecin 2 - 0 Stal Mielec
newLinenewLine20:30 Plock - Gornik Z.
Before adding the new line, you can do one more replacement to ensure the scores are in 1 line like 0-1
Demo: https://regex101.com/r/Dur5lD/4
Pattern: Match: (\d{1,2})\s*-\s*(\d{1,2})
; Replacement: $1-$2
Explanation: Since I have newlines in the text, I have used \s
to match space sequences. Used the capturing group $1
and $2
to get the desired output.
Once we have done this, adding newline
should be straight forward.
Demo: https://regex101.com/r/Dur5lD/5
Pattern: ^((?:Final)|(?:\d{2}:\d{2})|(?:\d{1,3}(?!\d)(?!-)))
Explanation:
Final
or hour
or time
.(?!)
. It implies the time value like 70 or 120 should not be followed by -
or another digit.Note:
\r\n
as new line characters. If not we might need to replace \s
and ^
characters in the expression with literal \r\n
.PERU
, so I manually removed the line.\n
with \t
and then replacing newLine
with \n
yielded https://regex101.com/r/Dur5lD/6.