In C# I have a String containing Whitespaces, carriage returns and/or line breaks. Is there a simple way to normalize large strings (100.000 to 1.000.000 characters) which are imported from textfiles as efficient as possible?
To clarify what I mean: Let's say my string looks like string1 but I want it to be like string2
string1 = " ab c\r\n de.\nf";
string2 = "abcde.f";
The term "efficiently" can heavily depend on your actual strings and number of them. I've come up with next benchmark (for BenchmarkDotNet) :
public class Replace
{
private static readonly string S = " ab c\r\n de.\nf";
private static readonly Regex Reg = new Regex(@"\s+", RegexOptions.Compiled);
[Benchmark]
public string SimpleReplace() => S
.Replace(" ","")
.Replace("\\r","")
.Replace("\\n","");
[Benchmark]
public string StringBuilder() => new StringBuilder().Append(S)
.Replace(" ","")
.Replace("\\r","")
.Replace("\\n","")
.ToString();
[Benchmark]
public string RegexReplace() => Reg.Replace(S, "");
[Benchmark]
public string NewString()
{
var arr = new char[S.Length];
var cnt = 0;
for (int i = 0; i < S.Length; i++)
{
switch(S[i])
{
case ' ':
case '\r':
case '\n':
break;
default:
arr[cnt] = S[i];
cnt++;
break;
}
}
return new string(arr, 0, cnt);
}
[Benchmark]
public string NewStringForeach()
{
var validCharacters = new char[S.Length];
var next = 0;
foreach(var c in S)
{
switch(c)
{
case ' ':
case '\r':
case '\n':
// Ignore then
break;
default:
validCharacters[next++] = c;
break;
}
}
return new string(validCharacters, 0, next);
}
}
This gives on my machine:
| Method | Mean | Error | StdDev |
|---------------- |------------:|----------:|----------:|
| SimpleReplace | 122.09 ns | 1.273 ns | 1.063 ns |
| StringBuilder | 311.28 ns | 6.313 ns | 8.850 ns |
| RegexReplace | 1,194.91 ns | 23.376 ns | 34.265 ns |
| NewString | 52.26 ns | 1.122 ns | 1.812 ns |
|NewStringForeach | 40.04 ns | 0.877 ns | 1.979 ns |