Search code examples
regexregex-group

Regex, How to find a repeating group of random numbers?


I'm currently parsing data from PDFs and I'd like to get the name and amount in a simple format: [NAME] [AMOUNT]

 NAME LAST
7 494 25 7 494 25 199 44
 NAME LAST
4 488 00 4 488 00 109 07
 NAME MIDDLE LAST
7 854 00 7 854 00 298 25
 NAME LAST
494 23 494 23 12 01
 NAME MIDDLE LAST
4 301 56 4 301 56 112 61
 NAME M LAST
13 359 25 13 359 25 130 54

This data means the following:
[NAME] [M?] [LAST]
[TOTAL WAGES] [PIT WAGES] [PIT WITHHELD]
NAME LAST $7,494.25 $7,494.25 $199.44
NAME LAST $4,488.00 $4,488.00 $109.07
NAME MIDDLE LAST $7,854.00 $7,854.00 $298.25
NAME LAST $494.23 $494.23 $12.01
NAME MIDDLE LAST $4,301.56 $4,301.56 $112.61
NAME M LAST $13,359.25 $13,359.25 $130.54

I'd like a regex to detect the duplicate group of numbers so that it parses to this:
NAME LAST $7,494.25
NAME LAST $4,488.00
NAME MIDDLE LAST $7,854.00
NAME LAST $494.23
NAME MIDDLE LAST $4,301.56
NAME M LAST $13,359.25

Hopefully, that makes sense. Thanks


Solution

  • Assuming that no-one in your organisation is making more than $1M or less than $1, this regex will do what you want:

     *([a-z][a-z ]+)\R+((\d+)(?: (\d+))? (\d+)) (?=\2).*
    

    It looks for

    • some number of spaces
    • names (simplistically) with [a-z][a-z ]+ (captured in group 1)
    • newline characters (\R+)
    • 2 or 3 sets of digits separated by spaces ((\d+)(?: (\d+))? (\d+)) (captured overall in group 2, with individual groups of digits captured in groups 3, 4 and 5)
    • a space, followed by an assertion that group 2 is repeated (?=\2)
    • characters to match the rest of the string to end of line (may not be required, dependent on your application) (.*)

    You can replace that with

    $1 \$$3$4.$5
    

    to get the following output for your sample data:

    NAME LAST $7494.25
    NAME LAST $4488.00
    NAME MIDDLE LAST $7854.00
    NAME LAST $494.23
    NAME MIDDLE LAST $4301.56
    NAME M LAST $13359.25
    

    Demo on regex101

    If you're using JavaScript, you need a couple of minor changes. In the regex, replace \R with [\r\n] as JavaScript doesn't recognise \R. In the substitution, replace \$ with $$.

    Demo on regex 101

    If your regex flavour supports conditional replacements, you can add a , between the thousands and hundreds by checking if group 4 was part of the match:

    $1 \$$3${4:+,}$4.$5
    

    In this case the output is:

    NAME LAST $7,494.25
    NAME LAST $4,488.00
    NAME MIDDLE LAST $7,854.00
    NAME LAST $494.23
    NAME MIDDLE LAST $4,301.56
    NAME M LAST $13,359.25
    

    Demo on regex101