Search code examples
javascriptnode.jsregexasciinon-ascii-characters

Perform multiple Regex filters on text content in Node.js with Javascript


I have multiple regex filters I want to run on a .txt file within Node. I read the file then set the contents as a variable, i then want to parse the contents with regex to remove any illegal characters.

I originally attempted to use one of the only Node modules I found could do this, called https://www.npmjs.com/package/clean-text-utils - However it seems to be aimed at Typescript and I couldn't get it to work with Node 8.10. So I dug into the node_module to find the relevant JS to try and replace illegal charcters using the function.

How can I run the all the regex filters on the myTXT variable? At the moment, it just outputs the text with the incorrect non-ASCII apostrophes.

var myTXT;

...

const readFile = util.promisify(fs.readFile);
await readFile('/tmp/' + myfile, 'utf8')
    .then((text) => {
        console.log('Output contents: ', text);
        myTXT = text;
    })
    .catch((err) => {
        console.log('Error', err);
    });

var myTXT = function (myTXT) {
    var s = text
        .replace(/[‘’\u2018\u2019\u201A]/g, '\'')
        .replace(/[“”\u201C\u201D\u201E]/g, '"')
        .replace(/\u2026/g, '...')
        .replace(/[\u2013\u2014]/g, '-');
    return s.trim();
};

console.log('ReplaceSmartChars is', myTXT);

Here is an example of the issues with apostrophes caused by copying text from a web page and pasting into a .txt file, also shown in PasteBin:

Resilience is what happens when we’re able to move forward even when things don’t fit together the way we expect.

And tolerances are an engineer’s measurement of how well the parts meet spec. (The word ‘precision’ comes to mind). A 2018 Lexus is better than 1968 Camaro because every single part in the car fits together dramatically better. The tolerances are more narrow now.

https://pastebin.com/uJ7GAKk4

Copied from the following URL and pasted into Notepad and saved

https://seths.blog/storyoftheweek/


Solution

  • At the moment you don't call your function that performs the replacement, you are instead overwriting the function with your text.

    const readFile = util.promisify(fs.readFile);
    
    function replaceChars(text) {
       return text
            .replace(/[‘’\u2018\u2019\u201A]/g, '\'')
            .replace(/[“”\u201C\u201D\u201E]/g, '"')
            .replace(/\u2026/g, '...')
            .replace(/[\u2013\u2014]/g, '-')
            .trim();
    }
    
    const myTXT = await readFile('/tmp/' + myfile, 'utf8')
        .then((text) => {
            console.log('Output contents: ', text);
            return replaceChars(text);
        })
        .catch((err) => {
            console.log('Error', err);
        });
    
    console.log('ReplaceSmartChars is', myTXT);