I need a perl regular expression to select content only if the content is missing either the begining quote or ending quote. The begining quote will always be precedded by an equal symbol =. The ending quote can be followed by a space, more text or carriage return. In one given line there can be many attributes (quote pairs) to check.
I tried (?<!")(.*?)"
but that was a disaster. I thought maybe I could just do a simple regex find the equle symbol, look at next character and check if it's a quote followed by text and an end quote. But if there is no quote on the begining or end of the text add it.
Things to note the text in between the quotes will always be character data. There will be no symbols or spaces.
<table pgwide="0" id="dvr_config_firmware>
<title>DFR Firmware</title>
<tgroup cols="2">
<colspec colname="col1>
<colspec colname="col2">
The question was first about correcting the invalid XML block
<table pgwide="0" id="dvr_config_firmware>
<title>DFR Firmware</title>
<tgroup cols="2">
<colspec colname="col1>
<colspec colname="col2">
to the valid XML block
<table pgwide="0" id="dvr_config_firmware">
<title>DFR Firmware</title>
<tgroup cols="2">
<colspec colname="col1">
<colspec colname="col2">
UltraEdit for Windows version 28.20.0.70 and UEStudio version 21.10.0.24 are the currently latest versions using both the Perl regular expression engine of the Boost library.
The Perl compatible regular search expression suggested in sln’s answer is:
=(?|(")([^"<>\s]*)()(?=[\s>]|\/>)|(?!")()([^"<>\s]*)("))
It produces the correct result with UE v28.20.0.70 and UES v21.10.0.24 and some other not too old former versions on using as replace string ="$2"
.
The Python compatible variant with the search expression
=(?:(")([^"<>\s]*)()(?=[\s>]|\/>)|(?!")()([^"<>\s]*)("))
used with the replace string ="\2\5"
as suggested next by sln works also well with currently latest and also former versions UE/UES for the sample data.
There was written next in a comment by JennyP that the XML file could contain also an attribute value with spaces with missing the end quote like in this XML sample block:
<table pgwide="0" id="dvr_config_firmware>
<title>DFR Firmware</title>
<tgroup cols="2">
<colspec colname="col1>
<colspec colname="col2">
<info date="09 JAN 2000 version="1.0">
The expected result is now:
<table pgwide="0" id="dvr_config_firmware">
<title>DFR Firmware</title>
<tgroup cols="2">
<colspec colname="col1">
<colspec colname="col2">
<info date="09 JAN 2000" version="1.0">
But the first two regular expressions in Perl syntax suggested by sln produce:
<table pgwide="0" id="dvr_config_firmware">
<title>DFR Firmware</title>
<tgroup cols="2">
<colspec colname="col1">
<colspec colname="col2">
<info date="09" JAN 2000 version="1.0">
There is just the day of the date enclosed in "
instead of the entire date because of the two regular expressions are not designed for attribute values with one or more spaces as that was no requirement initially.
The Perl compatible solution suggested by sln was using the search expression
=(?|(")((?:(?![a-z]*=)[^"<>])*)()(?=[\s>]|/>)|(?!")()((?:(?![a-z]*=)[^"<>])*)("))
and ="\2"
as replace expression string which results on execution with UE/UES on the expected result.
The result is also correct on using the Python compatible search expression
=(?:(")((?:(?![a-z]*=)[^"<>])*)()(?=[\s>]|/>)|(?!")()((?:(?![a-z]*=)[^"<>])*)("))
with the replace string ="\2\5"
.
@sln, well done!
The same task was discussed in the meantime also with UltraEdit forum topic Regular expression to search for attributes with missing a quote.
I posted a reply in the UltraEdit forum with an even wronger XML block:
<table pgwide=0" id="dvr_config_firmware>
<title>DFR Firmware</title>
<tgroup cols="3">
<colspec colname=col1>
<colspec colname="col2">
<colspec colname="col3 attrib="xyz">
<applicdef verdate="18 Jan 2019 verstatus="ver">
The first attribute pgwide
misses the beginning quote. The attribute value col1
is not enclosed in quotes at all. The attribute value col3
misses the end quote and there is next one more attribute which is also the case on last XML element with an attribute value with spaces and a missing end quote.
The expected XML block is:
<table pgwide="0" id="dvr_config_firmware">
<title>DFR Firmware</title>
<tgroup cols="3">
<colspec colname="col1">
<colspec colname="col2">
<colspec colname="col3" attrib="xyz">
<applicdef verdate="18 Jan 2019" verstatus="ver">
The Perl and the Python compatible expressions as suggested by sln from second chapter make a good job on adding the quotes on those attribute values where just one is missing either at the beginning or the end. But attribute value col1
is not enclosed in quotes. That was of course no requirement for the task.
I suggested to use two Perl compatible regular expression replaces to get the expected result:
\w=\K([^"=>]+)(?=>)
and use "$1"
or "\1"
as replace string to enclose those attribute values in quotes with both quotes missing like col1
.\w=\K(?:(?!")|"[^">]*\K(?=>)|"[^ >"]++(?= \w+=)\K|"(?:[^ >"]++(?![>"])(?! \w+=) )+[^ ">]+\K)
and use just "
for inserting the missing quotes at beginning or at end of an attribute value on which just one quote is missing.The UltraEdit forum member Fleggy posted another solution using a conditional Perl compatible regular expression with search string \w=\K(")?([\w ]+)(?(1)(?!")|"?)(?!\w*[="])
and replace string "\2"
which works also in Notepad++.
All the regular expressions written above have one problem:
They can modify also attribute values which are enclosed already correct in quotes.
Example: The XML block is already:
<table pgwide="0" id="dvr_config_firmware">
<title>DFR Firmware</title>
<tgroup cols="3">
<colspec colname="col1">
<colspec colname="col2">
<colspec colname="col3" attrib="xyz">
<applicdef verdate="18 Jan 2019" verstatus="ver">
The usage of the regular expression replace should not result in any modification of this block. But none of the regular expression replaces above do nothing on execution on this block. The attribute value 18 Jan 2019
causes an insert of one more "
which makes the XML block invalid for XML parsers.
But thanks to Fleggy, there is also a solution to add safely missing quotes on attribute values which can have spaces and on which a quote is missing at the beginning or at the end or on both sides while correct quoted attribute values are not modified in any way.
The ultimate Perl compatible regular search expression for this task is:
\w=\K(")?([\w ]+)(?(1)(?(?=")(*SKIP)(*FAIL))|"?)(?!\w*=)
The replace expression string is: "\2"
Thank you, Fleggy.