I've been unfortunate to have come accross alot of malformed XML.
I cannot get the correct regex to remove 2 spaces inside the attribute/key.
My current regex also checks to see if there is a valid "=" attribute.
XML attributes have to have a value or / and just one key.
for example
<ImValid></ImValid>
<Im not Valid></Im not Valid>
<ImValid with="somthing"></ImValid>
Here is my malformed XML:
<Addresses>
<Address>
<Delivery id>123123</Delivery id>
<Delivery Code Id>123123</Delivery Code Id>
<Full Name>Agent Smith</Full Name>
</Address>
<Address>
<Delivery id>12322123</Delivery id>
<Delivery Code Id>12zz3123</Delivery Code Id>
<Full Name>Mr Anderson</Full Name>
</Address>
<Addresses>
I'm trying to repair it using regex.
AstringVar => Regex.Replace(AstringVar , @"(?=<[^=]+?>)(?=</?\w+\s+\w+)(<.*?)(\s+)(.*?>)", @"$1$3", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase)
This Will Change This
<Full Name>Mr Anderson</Full Name>
This
<FullName>Mr Anderson</FullName>
Improving .. But also missing the last space..
<DeliveryCode Id>12zz3123</DeliveryCode Id>
Ok...i could run it twice ... but.. that seems ugly... how would I be able to get both 1 space and 2 spaces whilst also avoiding the values Thanks to any regex heros who can help ...!
Here it is on regex101https://regex101.com/r/dVs51I/3
Looking at your pattern, you want to:
<[^=]+?>
Make sure that there is no =
between <...>
(?=</?\w+\s+\w+)
Make sure that the first char is a word char after <
or </
and that there is at least a whitespace char and a second word character(<.*?)(\s+)(.*?>)
Match 1 or more whitespace chars between <...>
The issue here is that (<.*?)(\s+)(.*?>)
will have only a single match.
Also when you have <test ></test >
you will not match the last space as there is only a single word.
Note that this is for the given examples and is not fool proof to the versatility of xml.
Using C# you can make use of an infinite quantifier in a lookbehind assertion to get multiple matches.
(?<=</?\s*\w[^<>=]*)\s+(?=[^=<>]*>)
The pattern matches:
(?<=
Positive lookbehind, assert that to the left is
</?
Match either <
or </
\s*\w
Match optional whitespace chars followed by a single word char[^<>=]*
Optionally repeat matching any character except <
>
=
)
Close the lookbehind assertion\s+
Match 1 or more whitespace characters(?=
Positive lookahead, assert that to the right is
[^=<>]*
Optionally repeat matching any character except <
>
=
>)
Match >
See a C# regex demo.