Search code examples
c#stringunicodeascii

How do I Escape Foreign Characters using Regex or StringBuilder?


I have the following method to clean up strings:

public static String UseStringBuilderWithHashSet(string strIn)
    {
        var hashSet = new HashSet<char>("?&^$#@!()+-,:;<>’\'-_*");
        // specify capacity of StringBuilder to avoid resizing
        StringBuilder sb = new StringBuilder(strIn.Length);
        foreach (char x in strIn.Where(c => !hashSet.Contains(c)))
        {
            sb.Append(x);
        }
        return sb.ToString();
    }

However, strings such as [MV] REOL ちるちる ChiruChiru or [MV] REOL ヒビカセ Hibikase do not get cleaned up.

How can I modify my method so it can turn one of the above strings into for example: [MV] REOL ChiruChiru


Solution

  • You're trying to solve this exhaustively by filtering out everything you don't want. This is not optimal as their are 100,000+ possible characters.

    You may find better results if you only accept what you do want.

    public static string CleanInput(string input)
    {
        //a-zA-Z allows any English alphabet character upper or lower case
        //\[ and \] allows []
        //\s allows whitespace
        var regex = new Regex(@"[a-zA-Z\[\]\s]");
        var stringBuilder = new StringBuilder(input.Length);
        foreach(char c in input){
            if(regex.IsMatch(c.ToString())){
                stringBuilder.Append(c);
            }
        }
        string output = stringBuilder.ToString();
        //\s+ will match on any duplicate spaces and replace it with
        //a single space.
        return Regex.Replace(output , @"\s+", " ");
    }