Search code examples
javascriptregexxregexp

javascript regular expression to replace special characters, but allow a whitelist, using xregexp


I want to replace most special characters from a string (in javascript), but allow some special cases, like c++, c# and more. I have experimented with the xregexp library in node.js and I am able to remove all non letters and numbers, I think. I would also like to allow all foreign language letters. This is what I have so far:

  var str = "I do programming in c++ and sometimes c#, but + and # should be removed";
  regex = XRegExp('[^\\s\\p{N}\\p{L}]+', 'g');
  var replaced = XRegExp.replace(str, regex, "");
  console.log(replaced); 

This outputs

I do programming in c and sometimes c, but and should be removed

I need to create some kind of list with allowed words, like c++ and c#. Desired output is:

I do programming in c++ and sometimes c#, but and should be removed

Solution

  • You can just use alternations inside a capturing group and then restore this text with a backreference in the replacement pattern:

    var str = "I do programming in c++ and sometimes c#, but + and # should be removed";
    regex = XRegExp('(\\b(?:c[+]{2}|c#)(?!\\w))|[^\\s\\p{N}\\p{L}]+', 'ig');
    //               ^-- capture group 1 -----^                        ^  
    var replaced = XRegExp.replace(str, regex, "$1");
    //                                          ^^
    console.log(replaced);
    <script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/2.0.0/xregexp-all-min.js"></script>

    Note I added an i flag to make the pattern case insensitive, \b in the beginning of the alternations to only match at the word boundary (since c++ and c# start with a letter (word character), and the lookahead (?!\w) that makes sure there is no word character after + and # (\b would not work here as these are not word characters).