Search code examples
c#.netstringremoving-whitespacebenchmarkdotnet

C# remove carriage returns, line breaks and whitespaces from string as efficient as possible (benchmark)


In C# I have a String containing Whitespaces, carriage returns and/or line breaks. Is there a simple way to normalize large strings (100.000 to 1.000.000 characters) which are imported from textfiles as efficient as possible?

To clarify what I mean: Let's say my string looks like string1 but I want it to be like string2

string1 = " ab c\r\n de.\nf";
string2 = "abcde.f";

Solution

  • The term "efficiently" can heavily depend on your actual strings and number of them. I've come up with next benchmark (for BenchmarkDotNet) :

    public class Replace
    {
        private static readonly string S = " ab c\r\n de.\nf";
        private static readonly Regex Reg = new Regex(@"\s+", RegexOptions.Compiled);
    
        [Benchmark]
        public string SimpleReplace() => S
           .Replace(" ","")
           .Replace("\\r","")
           .Replace("\\n","");
    
        [Benchmark]
        public string StringBuilder() => new StringBuilder().Append(S)
           .Replace(" ","")
           .Replace("\\r","")
           .Replace("\\n","")
           .ToString();
    
        [Benchmark]
        public string RegexReplace() => Reg.Replace(S, "");
    
        [Benchmark]
        public string NewString()
        {
                var arr = new char[S.Length];
                var cnt = 0;
                for (int i = 0; i < S.Length; i++)
                {
                    switch(S[i])
                    {
                        case ' ':
                        case '\r':
                        case '\n':
                            break;
    
                        default:
                            arr[cnt] = S[i];
                            cnt++;
                            break;
                    }
                }
    
                return new string(arr, 0, cnt);
        }
    
        [Benchmark]
        public string NewStringForeach()
        {
            var validCharacters = new char[S.Length];
            var next = 0;
    
            foreach(var c in S)
            {
                switch(c)
                {
                    case ' ':
                    case '\r':
                    case '\n':
                        // Ignore then
                        break;
    
                    default:
                        validCharacters[next++] = c;
                        break;
                }
            }
    
            return new string(validCharacters, 0, next);
        }
    } 
    

    This gives on my machine:

    |          Method |        Mean |     Error |    StdDev |
    |---------------- |------------:|----------:|----------:|
    |   SimpleReplace |   122.09 ns |  1.273 ns |  1.063 ns |
    |   StringBuilder |   311.28 ns |  6.313 ns |  8.850 ns |
    |    RegexReplace | 1,194.91 ns | 23.376 ns | 34.265 ns |
    |       NewString |    52.26 ns |  1.122 ns |  1.812 ns |
    |NewStringForeach |    40.04 ns |  0.877 ns |  1.979 ns |