Search code examples
javascriptregexweb-scrapingpuppeteeretl

Sanitizing strings with several types of quotation marks


Working on a side-project which aggregates data from various websites, sanitizes the input data, then stores it in postgres.

Currently, I have to implement my own solutions for sanitizing dirty/ugly data, which hasn't been too bad but I've run into an issue with height measurements where there's a mixed bag of quote types, e.g. 5’4″, 5′ 9″

I'd like to sanitize the strings as follows:

  • , and similar characters are replaced with single quotes for feet.
  • and similar characters are replaced with double quotes for inches.

Is there a library which solves this problem?
If not, is there a concise regex that provides the same result?


Solution

  • We can use a regex replacement with lookup approach here:

    var map = {};
    map["’"] = "'";
    map["′"] = "'";
    map["″"] = "\"";
    
    var input = "5’4″ and 5′ 9″";
    var output = input.replace(/[’′″]/g, (x) => map[x]);
    console.log(input + " => " + output);