Search code examples
c#regexregex-lookaroundsregex-groupregexp-replace

Regex to repair malformed XML attributes, removing spaces in tags


I've been unfortunate to have come accross alot of malformed XML.
I cannot get the correct regex to remove 2 spaces inside the attribute/key.

My current regex also checks to see if there is a valid "=" attribute.

XML attributes have to have a value or / and just one key.

for example

<ImValid></ImValid>
<Im not Valid></Im not Valid>
<ImValid with="somthing"></ImValid>

Here is my malformed XML:

<Addresses>
  <Address>
    <Delivery id>123123</Delivery id>
    <Delivery Code Id>123123</Delivery Code Id>
    <Full Name>Agent Smith</Full Name>
  </Address>
  <Address>
    <Delivery id>12322123</Delivery id>
    <Delivery Code Id>12zz3123</Delivery Code Id>
    <Full Name>Mr Anderson</Full Name>
  </Address>
<Addresses>

I'm trying to repair it using regex.

AstringVar => Regex.Replace(AstringVar , @"(?=<[^=]+?>)(?=</?\w+\s+\w+)(<.*?)(\s+)(.*?>)", @"$1$3", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase)

This Will Change This

<Full Name>Mr Anderson</Full Name>

This

<FullName>Mr Anderson</FullName>

Improving .. But also missing the last space..

<DeliveryCode Id>12zz3123</DeliveryCode Id>

Ok...i could run it twice ... but.. that seems ugly... how would I be able to get both 1 space and 2 spaces whilst also avoiding the values Thanks to any regex heros who can help ...!

Here it is on regex101https://regex101.com/r/dVs51I/3


Solution

  • Looking at your pattern, you want to:

    • <[^=]+?> Make sure that there is no = between <...>
    • (?=</?\w+\s+\w+) Make sure that the first char is a word char after < or </ and that there is at least a whitespace char and a second word character
    • (<.*?)(\s+)(.*?>) Match 1 or more whitespace chars between <...>

    The issue here is that (<.*?)(\s+)(.*?>) will have only a single match.

    Also when you have <test ></test > you will not match the last space as there is only a single word.


    Note that this is for the given examples and is not fool proof to the versatility of xml.

    Using C# you can make use of an infinite quantifier in a lookbehind assertion to get multiple matches.

    (?<=</?\s*\w[^<>=]*)\s+(?=[^=<>]*>)
    

    The pattern matches:

    • (?<= Positive lookbehind, assert that to the left is
      • </? Match either < or </
      • \s*\w Match optional whitespace chars followed by a single word char
      • [^<>=]* Optionally repeat matching any character except < > =
    • ) Close the lookbehind assertion
    • \s+ Match 1 or more whitespace characters
    • (?= Positive lookahead, assert that to the right is
      • [^=<>]* Optionally repeat matching any character except < > =
    • >) Match >

    See a C# regex demo.