I want to create a JS script that can identify non-ASCII characters in a string and replace them with their corresponding Unicode code points (i.e. Lorem ipsum á dolor sit amet
becomes Lorem ipsum [00E1] sit amet
). I've already created a basic program for doing so, but I've noticed that it interprets characters with more than 5 digits in their code point (e.g. 𝔸, which is Unicode U+1D538) as two separate characters:
function findNonAsciiCodes(string) {
let k = string.replace(/[^ -~]/g, x => `[${x.codePointAt(0).toString(16).toUpperCase().padStart(4, '0')}]`);
console.log(k)
}
findNonAsciiCodes("Lorem ispum á dolor sit amet") // with 4 digit unicode character: [00E1]
findNonAsciiCodes("Lorem ipsum 𝔸 dolor sit amet") // with 5 digit unicode character: [D835][DD38] - undesired
After some digging on StackOverflow, I've found that Java/JS cannot easily Unicode characters with more than 4 digits in their code point - instead having to represent them with a "surrogate pair" of 2 characters (https://stackoverflow.com/questions/19557026/how-to-display-5-digit-unicode-characters-such-as-a-speaker-u1f50a)).
I want to find a way to, ideally using string.replace()
and no for loops, find the proper code points for these characters; however, when using string.replace()
, it seems to produce two separate strings rather than creating an array (i.e. "𝔸".replace(/[^ -~]/g, x => console.log(x.codePointAt(0).toString(16).toUpperCase()))
prints D835
and DD38
separately).
I have noticed that matching both characters of the surrogate pair at once produces the desired result (the full 5 digit code), but I don't know how to create regex that can distinguish between surrogate pair members and neighboring non-ASCII characters; in the example below - which matches between 1 and 2 non-ascii characters to try and catch surrogate pairs - 4 digit code non-ASCII characters can "blend" with other non-ASCII characters immediately succeeding them, including surrogate pair characters:
function findNonAsciiCodes(string) {
let k = string.replace(/[^ -~]{1,2}/g, x => `[${x.codePointAt(0).toString(16).toUpperCase().padStart(4, "0")}]`);
// regex matches between 1 and 2 non-ascii characters at once
console.log(k);
}
findNonAsciiCodes("Lorem ipsum 𝔸 dolor sit amet"); // gives correct code ([1D538])
findNonAsciiCodes("Lorem ipsum 𝔸á dolor sit amet"); // still gives correct codes ([1D538][00E1])
findNonAsciiCodes("Lorem ipsum áá dolor sit amet"); // 4 digit before 4 digit; ignores second character due to "blending" ([00E1])
// ^ desired output: Lorem ispum [00E1][00E1] dolor sit amet
findNonAsciiCodes("Lorem ipsum á𝔸 dolor sit amet"); // 4 digit before 5 digit, gives incorrect codes due to "blending" ([00E1][DD38])
// ^ desired output: Lorem ipsum [00E1][1D538] dolor sit amet
Is it possible to create a Regex expression capable of distinguishing between normal non-ASCII characters and surrogate pair characters, or use JS within the .replace()
capable of "decoding" the surrogate pairs? If not, is it possible to achieve the desired result without for
loops?
You can use (?![\x00-\uFFFF])./gu
to match any Unicode character point that does not belong to BMP plane:
function findNonAsciiCodes(string) {
let k = string.replace(/(?![\x00-\uFFFF])./gu, x => `[${x.codePointAt(0).toString(16).toUpperCase().padStart(4, "0")}]`);
console.log(k);
}
findNonAsciiCodes("Lorem ipsum 𝔸 dolor sit amet"); // Lorem ipsum [1D538] dolor sit amet
findNonAsciiCodes("Lorem ipsum 𝔸á dolor sit amet"); // Lorem ipsum [1D538]á dolor sit amet
findNonAsciiCodes("Lorem ipsum áá dolor sit amet"); // Lorem ipsum áá dolor sit amet
findNonAsciiCodes("Lorem ipsum á𝔸 dolor sit amet"); // Lorem ipsum á[1D538] dolor sit amet