Search code examples
regexregex-lookarounds

Regex Find Upto a first value it finds before a match


I have a string value and because of how the string is populated (out of my control) I have \n new line instances right in the middle of a company name.

I wanted to do a regex replace on the particular matches to replace the \n with a space.

This is a snippet of my output (it can change. but all I'm trying to match all occurrences to the first \n it finds before a Date. and extract the text between those.

\nGBP*\nAA1234567 A random company name - I 03-Mar-2023 BUY 42.6400 42.6900 GBP 1,820.3016 1.0000 1,842.4400\nAA1234568 Another randon company name - H-M 03-Mar-2023 BUY 11.9880 845.6000 GBP 10,137.0528 1.0000 10,159.1700\nAA12345679 Third Party Utilies - Fund - Class\nAA-B Income\n03-Mar-2023 BUY 6.4120 836.9100 GBP 5,366.2669 1.0000 5,388.5200\nAA12345670 Company 4 - M 03-Mar-2023 BUY 205.6830 10.8500 GBP 2,231.6606 1.0000 2,253.7800\nAA2345678 Another random page up company - I 03-Mar-2023 BUY 66.3850 45.4400 GBP 3,016.5344 1.0000 3,038.6500\nASSET SCHEDULE\nPolicy Number 1234-56789\nAA2345679 Company 5 Utilities- M 03-Mar-2023 BUY 76.7370 13.7400 GBP 1,054.3664 1.0000 1,076.4900\nTotal

Its currently returning.

GBP*\nAA1234567 A random company name - I 03-Mar-2023
AA1234568 Another random company name - H-M 03-Mar-2023
AA12345679 Third Party Utilities - Fund - Class\nAA-B Income\n03-Mar-2023
AA12345670 Company 4 - M 03-Mar-2023
AA2345678 Another random page up company - I 03-Mar-2023
ASSET SCHEDULE\nPolicy Number     1234-56789\nAA2345679 Company 5 Utilities- M 03-Mar-2023

But what I want to retrieve is the following.

AA1234567 A random company name - I 03-Mar-2023 BUY 42.6400 42.6900 GBP 1,820.3016 1.0000 1,842.4400
AA1234568 Another random company name - H-M 03-Mar-2023 BUY 11.9880 845.6000 GBP 10,137.0528 1.0000 10,159.1700
AA12345679 Third Party Utilities - Fund - Class\nAA-B Income\n03-Mar-2023 BUY 6.4120 836.9100 GBP 5,366.2669 1.0000 5,388.5200
AA12345670 Company 4 - M 03-Mar-2023 BUY 205.6830 10.8500 GBP 2,231.6606 1.0000 2,253.7800
AA2345678 Another random page up company - I 03-Mar-2023 BUY 66.3850 45.4400 GBP 3,016.5344 1.0000 3,038.6500
AA2345679 Company 5 Utilities- M 03-Mar-2023 BUY 76.7370 13.7400 GBP 1,054.3664 1.0000 1,076.4900

The third row in this occasion contains 2 new lines Class\nAA-B Income\n

My Pattern is as follows

(?<=\\n).*?([a-zA-Z]{3})-(\d{4})

https://regex101.com/r/aiDk9G/1

if there's an easier way please let me know.

Thanks in advance

Tried multiple patterns but cant seem to quite get it.


Solution

  • You may use this regex:

    (?<=\\n)(?:[A-Z]+[0-9][A-Z0-9]*|-)(?:\s+\w+)+.*?[a-zA-Z]{3}-\d{4}.+?(?=\\n)
    

    RegEx Demo

    RegEx Demo:

    • (?<=\\n): Lookbehind to assert presence of \n at the previous position
    • (?:: Start non-capture group
      • [A-Z]+: Match 1+ of uppercase letters
      • [0-9] : Match a digit
      • [A-Z0-9]*: Match 0 or more uppercase letters or digits
      • | OR
      • -: Match a -
    • ): End non-capture group
    • (?:\s+\w+)+: Match company separated with 1+ whitespaces
    • .*?: Match 0+ of any character (non-greedy)
    • [a-zA-Z]{3}-\d{4}: Match month-year substring
    • .+?: Match 1+ of any character (non-greedy)
    • (?=\\n): Lookahead to assert presence of \n at the next position