Search code examples
javascriptregexstringunicode

Replace non-ASCII characters with their Unicode code points when they have a 5+ digit Unicode code


I want to create a JS script that can identify non-ASCII characters in a string and replace them with their corresponding Unicode code points (i.e. Lorem ipsum á dolor sit amet becomes Lorem ipsum [00E1] sit amet). I've already created a basic program for doing so, but I've noticed that it interprets characters with more than 5 digits in their code point (e.g. 𝔸, which is Unicode U+1D538) as two separate characters:

function findNonAsciiCodes(string) {
  let k = string.replace(/[^ -~]/g, x => `[${x.codePointAt(0).toString(16).toUpperCase().padStart(4, '0')}]`);
  console.log(k)
}
findNonAsciiCodes("Lorem ispum á dolor sit amet") // with 4 digit unicode character: [00E1]
findNonAsciiCodes("Lorem ipsum 𝔸 dolor sit amet") // with 5 digit unicode character: [D835][DD38] - undesired

After some digging on StackOverflow, I've found that Java/JS cannot easily Unicode characters with more than 4 digits in their code point - instead having to represent them with a "surrogate pair" of 2 characters (https://stackoverflow.com/questions/19557026/how-to-display-5-digit-unicode-characters-such-as-a-speaker-u1f50a)).

I want to find a way to, ideally using string.replace() and no for loops, find the proper code points for these characters; however, when using string.replace(), it seems to produce two separate strings rather than creating an array (i.e. "𝔸".replace(/[^ -~]/g, x => console.log(x.codePointAt(0).toString(16).toUpperCase())) prints D835 and DD38 separately).

I have noticed that matching both characters of the surrogate pair at once produces the desired result (the full 5 digit code), but I don't know how to create regex that can distinguish between surrogate pair members and neighboring non-ASCII characters; in the example below - which matches between 1 and 2 non-ascii characters to try and catch surrogate pairs - 4 digit code non-ASCII characters can "blend" with other non-ASCII characters immediately succeeding them, including surrogate pair characters:

function findNonAsciiCodes(string) {
  let k = string.replace(/[^ -~]{1,2}/g, x => `[${x.codePointAt(0).toString(16).toUpperCase().padStart(4, "0")}]`);
  // regex matches between 1 and 2 non-ascii characters at once
  console.log(k);
 }
 findNonAsciiCodes("Lorem ipsum 𝔸 dolor sit amet"); // gives correct code ([1D538])
 findNonAsciiCodes("Lorem ipsum 𝔸á dolor sit amet"); // still gives correct codes ([1D538][00E1])
 findNonAsciiCodes("Lorem ipsum áá dolor sit amet"); // 4 digit before 4 digit; ignores second character due to "blending" ([00E1])
 // ^ desired output: Lorem ispum [00E1][00E1] dolor sit amet
 findNonAsciiCodes("Lorem ipsum á𝔸 dolor sit amet"); // 4 digit before 5 digit, gives incorrect codes due to "blending" ([00E1][DD38])
 // ^ desired output: Lorem ipsum [00E1][1D538] dolor sit amet

Is it possible to create a Regex expression capable of distinguishing between normal non-ASCII characters and surrogate pair characters, or use JS within the .replace() capable of "decoding" the surrogate pairs? If not, is it possible to achieve the desired result without for loops?


Solution

  • You can use (?![\x00-\uFFFF])./gu to match any Unicode character point that does not belong to BMP plane:

    function findNonAsciiCodes(string) {
      let k = string.replace(/(?![\x00-\uFFFF])./gu, x => `[${x.codePointAt(0).toString(16).toUpperCase().padStart(4, "0")}]`);
      console.log(k);
    }
    findNonAsciiCodes("Lorem ipsum 𝔸 dolor sit amet"); // Lorem ipsum [1D538] dolor sit amet
    findNonAsciiCodes("Lorem ipsum 𝔸á dolor sit amet"); // Lorem ipsum [1D538]á dolor sit amet
    findNonAsciiCodes("Lorem ipsum áá dolor sit amet"); // Lorem ipsum áá dolor sit amet
    findNonAsciiCodes("Lorem ipsum á𝔸 dolor sit amet"); // Lorem ipsum á[1D538] dolor sit amet