Search code examples
phpnumber-formattingstring-parsing

How to get NumberFormatter::parse() to only parse actual numeric strings?


I’m trying to parse some strings in some messed-up CSV files (about 100,000 rows per file). Some columns have been squished together in some rows, and I’m trying to get them unsquished back into their proper columns. Part of the logic needed there is to find whether a substring in a given colum is numeric or not.

Non-numeric strings can be anything, including strings that happen to begin with a number; numeric strings are generally written the European way, with dots used for thousand separators and commas for decimals, so without going through a bunch of string replacements, is_numeric() won’t do the trick:

\var_dump(is_numeric('3.527,25')); // bool(FALSE)

I thought – naïvely, it transpires – that the right thing to do would be to use NumberFormatter::parse(), but it seems that function doesn’t actually check whether the string given as a whole is parseable as a numeric string at all – instead it just starts at the beginning and when it reaches a character not allowed in a numeric string, cuts off the rest.

Essentially, what I’m looking for is something that will yield this:

$formatter = new \NumberFormatter('de-DE', \NumberFormatter::DECIMAL);
\var_dump($formatter->parse('3.527,25')); // float(3527.25)
\var_dump($formatter->parse('3thisisnotanumber')); // bool(FALSE)

But all I can get is this:

$formatter = new \NumberFormatter('de-DE', \NumberFormatter::DECIMAL);
\var_dump($formatter->parse('3.527,25')); // float(3527.25)
\var_dump($formatter->parse('3thisisnotanumber')); // float(3)

I figured perhaps the problem was that the LENIENT_PARSE attribute was set to true, but setting it to false ($formatter->setAttribute(\NumberFormatter::LENIENT_PARSE, 0)) has no effect; non-numeric strings still get parsed just fine as long as they begin with a number.

Since there are so many rows and each row may have as many as ten columns that need to be validated, I’m looking at upwards of a million validations per file – for that reason, I would prefer avoiding a preg_match()-based solution, since a million regex match calls would be quite expensive.

Is there some way to tell the NumberFormatter class that you would like it to please not be lenient and only treat the string as parseable if the entire string is numeric?


Solution

  • You can strip all the separators and check if whatever remains is a numeric value.

    function customIsNumeric(string $value): bool
    {
        return is_numeric(str_replace(['.', ','], '', $value));
    }
    

    Live test available here.