I'm currently parsing data from PDFs and I'd like to get the name and amount in a simple format: [NAME] [AMOUNT]
NAME LAST
7 494 25 7 494 25 199 44
NAME LAST
4 488 00 4 488 00 109 07
NAME MIDDLE LAST
7 854 00 7 854 00 298 25
NAME LAST
494 23 494 23 12 01
NAME MIDDLE LAST
4 301 56 4 301 56 112 61
NAME M LAST
13 359 25 13 359 25 130 54
This data means the following:
[NAME] [M?] [LAST]
[TOTAL WAGES] [PIT WAGES] [PIT WITHHELD]
NAME LAST $7,494.25 $7,494.25 $199.44
NAME LAST $4,488.00 $4,488.00 $109.07
NAME MIDDLE LAST $7,854.00 $7,854.00 $298.25
NAME LAST $494.23 $494.23 $12.01
NAME MIDDLE LAST $4,301.56 $4,301.56 $112.61
NAME M LAST $13,359.25 $13,359.25 $130.54
I'd like a regex to detect the duplicate group of numbers so that it parses to this:
NAME LAST $7,494.25
NAME LAST $4,488.00
NAME MIDDLE LAST $7,854.00
NAME LAST $494.23
NAME MIDDLE LAST $4,301.56
NAME M LAST $13,359.25
Hopefully, that makes sense. Thanks
Assuming that no-one in your organisation is making more than $1M or less than $1, this regex will do what you want:
*([a-z][a-z ]+)\R+((\d+)(?: (\d+))? (\d+)) (?=\2).*
It looks for
[a-z][a-z ]+
(captured in group 1)\R+
)((\d+)(?: (\d+))? (\d+))
(captured overall in group 2, with individual groups of digits captured in groups 3, 4 and 5) (?=\2)
.*
)You can replace that with
$1 \$$3$4.$5
to get the following output for your sample data:
NAME LAST $7494.25
NAME LAST $4488.00
NAME MIDDLE LAST $7854.00
NAME LAST $494.23
NAME MIDDLE LAST $4301.56
NAME M LAST $13359.25
If you're using JavaScript, you need a couple of minor changes. In the regex, replace \R
with [\r\n]
as JavaScript doesn't recognise \R
. In the substitution, replace \$
with $$
.
If your regex flavour supports conditional replacements, you can add a ,
between the thousands and hundreds by checking if group 4 was part of the match:
$1 \$$3${4:+,}$4.$5
In this case the output is:
NAME LAST $7,494.25
NAME LAST $4,488.00
NAME MIDDLE LAST $7,854.00
NAME LAST $494.23
NAME MIDDLE LAST $4,301.56
NAME M LAST $13,359.25