Extract the requires text using Regular expression in notepad++

I have 10,000 characters length of xml text and I have to parse out the variable name and value next to it.

 example of text:

 <? xml version="1.0" of encoding="UTF-8"?><abcdefghij><**ToDateTimestamp**>2019-01-01T10:41:18- 
 05:00</xyzefg><**shAMount**>8000.00</afsfda;sfkj;alkfl;kaf>
 <? xml version="1.0" of encoding="UTF-8"?><abcdefghij><**ToDateTimestamp**>2019-02-01T10:41:18- 
 05:00</xyzefg><**shAMount**>7000.00</afsfda;sfkj;alkfl;kaf>

In the above text I have the data for two variables ToDateTimestamp and shAmount

Want:

ToDateTimestamp 2019-01-01T10:41:18-05:00
ToDateTimestamp 2019-02-01T10:41:18-05:00
shAmount 8000.00
shAmount 7000.00

I tried to write a macro in notepad++ to find for a particular text and repeat for all the records, but "Run macro multiple times" is not working. Is there any Regex approach to clear everything and keep the values next to the variable name that I mentioned? I can repeat this step for each variable separately.

Thanks for your help

Solution

While you should consider parsing HTML/XML I'm always dipping into Notepad++ to clean up data. You may need a few goes it this but to throw you something that may help...

https://regex101.com/r/uAPi97/1

Now the above is pretty much based on getting all the lines of...

<? xml version="1.0" of encoding="UTF-8"?><abcdefghij><**ToDateTimestamp**>2019-01-01T10:41:18-05:00</xyzefg><**shAMount**>8000.00</afsfda;sfkj;alkfl;kaf>

...on to one line each. So switch off word wrap and check they are. You may need to find (in 'Extend mode') the...

<?

...bit and replace with...

\r<?

...as an example. Then you can (perhaps) use regex to find the entire line (you have to find the whole line), then capture the bits of interest (these are wrapped in () so they are captured). Then do a find and replace in Notepad++ but with regex mode selected.

The regex...

^.*?(\d+-\d+-\d+T\d+:\d+:\d+-\d+:\d+).*(shAMount).*?(\d+\.\d+).*$

...finds the whole line and if you replace with...

$1$2$3

...then the three bits in () from the regex are put back. So this...

<? xml version="1.0" of encoding="UTF-8"?><abcdefghij><**ToDateTimestamp**>2019-01-01T10:41:18-05:00</xyzefg><**shAMount**>8000.00</afsfda;sfkj;alkfl;kaf>
<? xml version="1.0" of encoding="UTF-8"?><abcdefghij><**ToDateTimestamp**>2019-02-01T10:41:18-05:00</xyzefg><**shAMount**>7000.00</afsfda;sfkj;alkfl;kaf>
<? xml version="1.0" of encoding="UTF-8"?><abcdefghij><**ToDateTimestamp**>2019-02-01T10:41:18-05:00</xyzefg><**shAMount**>7000.00</afsfda;sfkj;alkfl;kaf>
<? xml version="1.0" of encoding="UTF-8"?><abcdefghij><**ToDateTimestamp**>2019-02-01T10:41:18-05:00</xyzefg><**shAMount**>7000.00</afsfda;sfkj;alkfl;kaf>
<? xml version="1.0" of encoding="UTF-8"?><abcdefghij><**ToDateTimestamp**>2019-02-01T10:41:18-05:00</xyzefg><**shAMount**>7000.00</afsfda;sfkj;alkfl;kaf>
<? xml version="1.0" of encoding="UTF-8"?><abcdefghij><**ToDateTimestamp**>2019-02-01T10:41:18-05:00</xyzefg><**shAMount**>7000.00</afsfda;sfkj;alkfl;kaf>

..sort of goes to this...

2019-01-01T10:41:18-05:00shAMount8000.00
2019-02-01T10:41:18-05:00shAMount7000.00
2019-02-01T10:41:18-05:00shAMount7000.00
2019-02-01T10:41:18-05:00shAMount7000.00
2019-02-01T10:41:18-05:00shAMount7000.00
2019-02-01T10:41:18-05:00shAMount7000.00

That may not be 100% what you want but from there you can clean it up a bit more and say, find (in Extend mode) 'shAMount' (without quotes) and replace with '\rshAMount' (without quotes). Few loops of find and replace and you may be closer to your goal.

But yes...if you do this a lot check out Python and HTML Parser - more to learn but quite powerful.