Search code examples
regexregex-groupregex-lookarounds

Regex Query (Match the 2nd instance prior to a word)


I have a random string retrieved from a PDF file. It contains an unstructured table which i have successfully formatted.

However I need to strip out a particular line (which I wont know what it says, but I do know specific words that are contained).

This is an example of the string

ASSET SCHEDULE\nPolicy Number 1234-567890\nASSET SCHEDULE\nPage 1 of 4ASSET SCHEDULE\nPolicy Number 1234-567890\nDeal id: 00030 Policy Number: 1234-567890\nDate: 28-Feb-2023 Policy Currency: GBP\nSell transactions\nISIN Asset name Transaction date Deal \ntype\nQuantity Price Asset \nCCY\nAsset\nvalue\nExchange rate Policy value \nGBP*\nAA00000012 Company 1 01-Mar-2023 SELL 811.7040 13.7556 GBP 11,165.4755 1.0000 11,143.2600\nAA00000013 Company 2 01-Mar-2023 SELL 220.0600 10.8200 GBP 2,381.0492 1.0000 2,358.8300\nAA00000014 Company 3 01-Mar-2023 SELL 16.1250 135.0200 GBP 2,177.1975 1.0000 2,154.9200\nAA00000015 Company 4 01-Mar-2023 SELL 4.3420 1,520.7085 GBP 6,602.9163 1.0000 6,580.1600\nAA00000016 Company 5 01-Mar-2023 SELL 0.7320 1,878.1622 GBP 1,374.8147 1.0000 1,352.2900\nAA00000017 Company 6 01-Mar-2023 SELL 106.3298 118.3700 GBP 12,586.2584 1.0000 12,564.0400\nAA00000018 Company 7 01-Mar-2023 SELL 109.5476 101.7800 GBP 11,149.7547 1.0000 11,127.5300\nTotal GBP : 47,281.03\nBuy transactions\nISIN Asset name Transaction date Deal \ntype\nQuantity Price Asset \nCCY\nAsset\nvalue\nExchange rate Policy value \nGBP*\nAA00000019 Company 8 03-Mar-2023 BUY 42.6400 42.6900 GBP 1,820.3016 1.0000 1,842.4400\nAA00000020 Company 9 03-Mar-2023 BUY 11.9880 845.6000 GBP 10,137.0528 1.0000 10,159.1700\nAA00000021 Company 10 03-Mar-2023 BUY 6.4120 836.9100 GBP 5,366.2669 1.0000 5,388.5200\nAA00000022 Company 11 03-Mar-2023 BUY 205.6830 10.8500 GBP 2,231.6606 1.0000 2,253.7800\nAA00000023 Company 12 03-Mar-2023 BUY 66.3850 45.4400 GBP 3,016.5344 1.0000 3,038.6500\nP99820\AR/03020/ZZ\n* Policy value amounts may include a transaction charge\nASSET SCHEDULE\nPage 2 of 4ASSET SCHEDULE\nPolicy Number 1234-567890\nAA00000024 Company 13 03-Mar-2023 BUY 76.7370 13.7400 GBP 1,054.3664 1.0000 1,076.4900\nTotal GBP : 23,759.05\nP99820\AR/03020/ZZ\n* Policy value amounts may include a transaction charge\nASSET SCHEDULE\nPage 3 of 4ASSET SCHEDULE\nPolicy Number 1234-567890\nDISCLAIMER\nThis document was produced by Once CLS S.A. (“Once CLS”) in March 2023. Its\ncontent is intended for informational purposes only and is not to be construed as a solicitation or an offer\nto buy or sell any life assurance product. Neither is the information contained herein intended to\nconstitute any form of legal, fiscal or investment advice. It should therefore only be used in conjunction\nwith appropriate independent professional advice obtained from a suitable and qualified source.\nWhilst every care has been taken in producing this document, no representation or warranty, whether\nexpress or implied, is made in relation to the accuracy, completeness or reliability of the information\ncontained herein, except with respect to information concerning Once CLS or its group companies.\nAll copyright in this material belongs to Once CLS.\nCopyright © 2023 Once CLS S.A.\nOnce CLS S.A.\nASSET SCHEDULE\nPage 4 of 4

I want to match from the 2nd \n it finds before \n Policy value and the \n after the Policy number 1234-56789

so the output for the match would be the following.

\nP03929\AB/0002/XX\n Policy value amounts may include a transaction charge\nASSET SCHEDULE\nPage 2 of 4ASSET SCHEDULE\nPolicy Number     1234-56789\n

(00\\n)(.*)(?=Policy value).*(\\n) matches from the first 00\n but i need it the first one before the match.

i hope that makes sense.

tried a few different scenarios but cant seem to get it to just find the specific 2nd \n before the match of Policy value


Solution

  • You may use this regex and grab capture group #1:

    00\\n.*?00(\\n(?:(?!00\\n).)*?Policy Number\s+[0-9-]+)\\n
    

    RegEx Demo

    RegEx Details:

    • 00: Match 1st 00
    • \\n: Match \n
    • .*?: Match 0 or more of any characters
    • 00: Match 2nd 00
    • (: Start capture group #1
      • \\n: Match \n
      • (?:(?!00\\n).): Match 0 or more of any characters if that is not followed by 00\n
      • Policy Number: Match Policy Number
      • \s+: Match 1+ whitespaces
      • [0-9-]+: Match 1+ pf hyphen or digits
    • ): End capture group #1
    • \\n: Match \n