Search code examples
regexpcrelookbehind

Multiline regex to match string after space-separated data


My goal is to achieve multiline unicode-aware string cleaning using regex.

I've started from this regex, which has no lookbehind limits:

(?<=[[:blank:]]).*

Then, I've found way to limit lookbehind as follows:

(?!.{20,})(?<=[[:blank:]]).*

It works on some cases, but not really stable(link) because string length is not predictable.

Also tailing comma is undesireable, but I've not figured out how it could be removed with regex, because due to it's unpredictable manner(see testcase).

How do I create propely limited lookbehind for this task? I'm using boost(pcre)-style regex.

Test cases:

In

РПÑАВÂРßÛÑ ÛÑРВßР ÛÑÛÑАÑÛ, 11.22 Ø.Á.
ÛÑРВЛÛÑВ ÛÑßВßДÛÑВßЛ РИÐРÛПÑÑВÛ 11.22 Ã.Ö
ВßÑÛВÂЛÛÑВ ÛÑВÂÛÑВЛß ßРßÂРÑВЛРÛÐßРВ, 11.22 Â.Ö.
ÛÑВÛÑВ ÛßÛßРÑВßРÐ ßТАÛ, 11.22 Ã.Ö.
РÐÑАВПРßÛÑ ÛÑРВßР ÛÑÛÑАÑÛ, 11.22 Ø.р.
ÛÑРВÂÛÑВ ÛÑßВßДÛÑВß РÂПРÛПÑÑВÛ 11.22 Ø.Á.
ВßÑÛВДЛÛÑВ ÛÑВЛÛÑВЛß ßРßЛРÑВЛРÛЛßРВ 11.22 Ø.Ö.
ÛÑВÛÑВ ÛßÛßРÑВßРÐ ßТАÛ, 11.22 Ï.Á.

Out

РПÑАВÂРßÛÑ ÛÑРВßР ÛÑÛÑАÑÛ
ÛÑРВЛÛÑВ ÛÑßВßДÛÑВßЛ РИÐРÛПÑÑВÛ
ВßÑÛВÂЛÛÑВ ÛÑВÂÛÑВЛß ßРßÂРÑВЛРÛÐßРВ
ÛÑВÛÑВ ÛßÛßРÑВßРÐ ßТАÛ
РÐÑАВПРßÛÑ ÛÑРВßР ÛÑÛÑАÑÛ
ÛÑРВÂÛÑВ ÛÑßВßДÛÑВß РÂПРÛПÑÑВÛ
ВßÑÛВДЛÛÑВ ÛÑВЛÛÑВЛß ßРßЛРÑВЛРÛЛßРВ
ÛÑВÛÑВ ÛßÛßРÑВßРÐ ßТАÛ

Solution

  • You could also match those unwanted parts:

    \s*[,\d].*
    

    then replace them with nothing (or remove them) in your environment.

    Live demo