Search code examples
c#regexunicode-string

Replacing special characters in files in the fastest way possible?


I have some files which contain special characters like é,ã,Δ,Ù etc. I want to replace them to their NCR (hex) 4 digit values. I've tried the below method of doing so but not sure whether it is the fastest possible way of achieving my goal...

var entities = new[]
{
new { ser = "\u00E9", rep = @"é" },
new { ser = "\u00E3", rep = @"ã" },
new { ser = "\u00EA", rep = @"ê" },
new { ser = "\u00E1", rep = @"á" },
new { ser = "\u00C1", rep = @"Á" },
new { ser = "\u00C9", rep = @"É" },
new { ser = "\u0394", rep = @"Δ" },
new { ser = "\u03B1", rep = @"α" },
new { ser = "\u03B2", rep = @"β" },
new { ser = "\u00B1", rep = @"±" },
//... so on
};

var files = Directory.GetFiles(path, "*.xml");
foreach (var file in files)
{
    string txt = File.ReadAllText(file);

    foreach (var entity in entities)
    {
        if (Regex.IsMatch(txt, entity.ser))
        {
            txt = Regex.Replace(txt, entity.ser, entity.rep);
        }
    };
    File.WriteAllText(file, txt);
}

Is there a faster way and more efficient way of doing this?


Solution

  • From the comments, you want to replace the unicode characters (eg Ù) with their Unicode value (&#x00D9). A Regex.Replace will likely be the best way to achieve this.

    Here is the loop for processing the files:

    var files = Directory.GetFiles(path, "*.xml");
    foreach (var file in files)
    {
        string txt = File.ReadAllText(file);
    
        string newTxt = Regex.Replace(
            txt,
            @"([^\u0000-\u007F]+)",
            HandleMatch);
    
        File.WriteAllText(file, newTxt);
    }
    

    And here is the match evaluator:

    private static char[] replacements = new[]
    {
        'ø',
        'Ù'
    };
    
    private static string HandleMatch(Match m)
    {
        // The pattern for the Regex will only match a single character, so get that character
        char c = m.Value[0];
    
        // Check if this is one of the characters we want to replace
        if (!replacements.Contains(c))
        {
            return m.Value;
        }
    
        // Convert the character to the 4 hex digit code
        string code = ((int) c).ToString("X4");
    
        // Format and return the code
        return "&#x" + code;
    }
    

    In the loop, you only need to read in the file once, then the Regex.Replace method will handle the replacement of all instances in the input. The pattern for the regex will match everything that is not in the range of 0x00 - 0x7f, which will be the first 255 characters (ASCII characters).

    If you need to only replace specific Unicode characters, you will need to build a list of these characters, and check the value of 'c' in the HandleMatch() function against that list.

    Comments on performance: You are trying to perform selective character replacement on a set of files. At a minimum, you are going to have to read each file into memory, then examine each character to see if it meets you criteria.

    A more performant option could be to build a lookup table of characters, then the replacement strings for each. The trade-off there would be that if you had a large list of chars that needed replacing, the table would quickly being unwieldy to maintain. You also leave the open the risk of errors in the replacement table, which would be more work to find.