Search code examples
javascriptregexlanguage-agnostic

How to parse and capture any measurement unit


In my application, users can customize measurement units, so if they want to work in decimeters instead of inches or in full-turns instead of degrees, they can. However, I need a way to parse a string containing multiple values and units, such as 1' 2" 3/8. I've seen a few regular expressions on SO and didn't find any which matched all cases of the imperial system, let alone allowing any kind of unit. My objective is to have the most permissive input box possible.

So my question is: how can I extract multiple value-unit pairs from a string in the most user-friendly way?


I came up with the following algorithm:

  1. Check for illegal characters and throw an error if needed.
  2. Trim leading and trailing spaces.
  3. Split the string into parts every time there's a non-digit character followed by a digit character, except for .,/ which are used to identify decimals and fractions.
  4. Remove all spaces from parts, check for character misuse (multiple decimal points or fraction bars) and replace '' with ".
  5. Split value and unit-string for each part. If a part has no unit:
    • If it is the first part, use the default unit.
    • Else if it is a fraction, consider it as the same unit as the previous part.
    • Else if it isn't, consider it as in, cm or mm based on the previous part's unit.
    • If it isn't the first part and there's no way to guess the unit, throw an error.
  6. Check if units mean something, are all of the same system (metric/imperial) and follow a descending order (ft > in > fraction or m > cm > mm > fraction), throw an error if not.
  7. Convert and sum all parts, performing division in the process.

I guess I could use string manipulation functions to do most of this, but I feel like there must be a simpler way through regex.


I came up with a regex:
((\d+('|''|"|m|cm|mm|\s|$) *)+(\d+(\/\d+)?('|''|"|m|cm|mm|\s|$) *)?)|((\d+('|''|"|m|cm|mm|\s) *)*(\d+(\/\d+)?('|''|"|m|cm|mm|\s|$) *))

It only allows fractions at the end and allows to place spaces between values. I've never used regex capturing though, so I'm not so sure how I'll manage to extract the values out of this mess. I'll work again on this tomorrow.


Solution

  • Note: This answer was last revised in 2024, in an attempt to preserve most of the information found in the original while using modern JavaScript features.


    My objective is to have the most permissive input box possible.

    Careful, more permissive doesn't always mean more intuitive. An ambiguous input should warn the user, not pass silently, as that might lead them to make multiple mistakes before they realize their input wasn't interpreted like they hoped.

    Looking at the pseudocode you wrote, you are trying to solve too many problems at once. Splitting up a string into small units of meaning is called tokenization, and regular expressions are really good at this. Making sense of those tokens, which is called parsing, can be done separately.

    In your situation, you could use regular expressions to first split up the input into a list of number, separator and unit tokens. You can also deal with whitespace, casing, aliases and unexpected characters here, so that you don't have to think about strings when you write the logic that deals with numbers and units.

    How can I extract multiple value-unit pairs from a string using a regex?

    Parentheses in a regular expression create capture groups, which can be retrieved from a match by some but not all JavaScript functions involving RegExp. In modern JavaScript, you can use string.matchAll(regex) with a regular expression that has the g flag to obtain an iterator of all the matches found in the string, including their capture groups.

    Each match will be a special array with additional properties. Item 0 is the entire match and items 1, 2 and onwards are the capture groups. index is the starting index of the match in the input string.

    For example, iterating over
    "1 hour 30 minutes".matchAll(/(\d+) ([a-z]+)/g) gives:

    • ["1 hour", "1", "hour"] with index: 0
    • ["30 minutes", "30", "minutes"] with index: 7

    You've been using parentheses to isolate OR clauses such as a|b and to apply quantifiers to a character sequence such as (abc)+. If you want to do that without creating capture groups, you can use (?: ) instead. This is called a non-capturing group. It does the same thing as regular parentheses in a regex, but what's inside it won't create an entry in the returned array.

    A more recent syntax for regular expressions lets you use (?<name>expression) to give capture groups a name, which are then also exposed as enumerable properties of match.groups. Reusing our example, iterating over
    "1 hour 30 minutes".matchAll(/(?<value>\d+) (?<unit>[a-z]+)/g) gives:

    • ["1 hour", "1", "hour"] with groups: { value: "1", unit: "hour" }
    • ["30 minutes", "30", "minutes"] with groups: { value: "30", unit: "minutes" }

    Okay, so how do we use regular expressions to make a tokenizer?

    Take a look at this regular expression:

    const tokenRx = /(?<whitespace>\s+)|(?<decimalSeparator>[.,])|(?<fractionSeparator>\/)|(?<integer>\d+)|(?<unit>km|cm|mm|m|ft|in|'|")/gi;
    

    Woah, that's hard to read. Here's the pattern inside each named capture group:

    name pattern
    whitespace \s+
    decimalSeparator [.,]
    fractionSeparator \/
    integer \d+
    unit km|cm|mm|m|ft|in|'|"

    This regular expression makes clever uses of capture groups separated by OR clauses. For each match, only one capture group will contain anything. If we look at the groups object of each match, all of its properties will be undefined except one. For example, the string "10 ft" would have matches with:

    • { integer: "10", ... }
    • { whitespace: " ", ... }
    • [ unit: "ft", ... }

    Using this, I built a function that takes a string and returns an iterator over its tokens. Here is a live demo so you can experiment with it. The code isn't pretty but I wanted to keep it short as this post is big enough already.

    const tokenRx = /(?<whitespace>\s+)|(?<decimalSeparator>[.,])|(?<fractionSeparator>\/)|(?<integer>\d+)|(?<unit>km|cm|mm|m|ft|in|'|")/gi;
    
        function* tokenize (input) {
            let cursor = 0;
    
            for (const match of input.matchAll(tokenRx)) {
                if (cursor !== match.index)
                    throw new Error("Unexpected characters: " + input.slice(cursor, match.index));
                cursor = match.index + match[0].length;
                yield Object.entries(match.groups).find(([name, value]) => value);
            }
    
            if (cursor !== input.length)
                throw new Error("Unexpected characters: " + input.slice(cursor));
        }
    
        const $ = s => document.querySelector(s);
        $("button").onclick = function handler (event) {
            try {
                $("output").textContent = JSON.stringify([...tokenize($("input").value)], undefined, 2);
            } catch (e) {
                $("output").textContent = e.toString();
            }
        };
    <label>Input to tokenize <input value="1&apos; 2&quot; 3/8"></label>
    <button>Tokenize</button>
    <pre><output></output></pre>

    This is way too generic and complex of a solution if all you're trying to do is accept imperial lengths and metric lengths. For that, I'd probably only write a different regular expression for each acceptable format, then test the user's input to see which one matches. If two different expressions match, then the input is ambiguous and we should warn the user. But hopefully this serves as a solid example for how string.matchAll(regexp) and named capture groups can be used.