Search code examples
phpweb-scrapinglocalizationnumericseparator

How to get the current decimal/thousand separator for an unknown number


I have to insert into the database a number acquired from various sites for which I don't know what the decimal separator/thousand separator is.

Input numbers that express the value 1000 could be:

1,000.00
1,000.000
1.000,00
1.000,000

Input numbers that express the value 1 could be:

1.00
1.000
1,00
1,000

Of course, the number as well as the decimals can vary.


Solution

  • The last occurring dot or comma will be the decimal separator according to your explanation. Split on the last symbol to access the integer, separator, and decimal value. Then process/format those parts as needed.

    Code: (Demo)

    $tests = [
        '1,000.00',
        '1,000.000',
        '1.000,00',
        '1.000,000',
        '1.00',
        '1.000',
        '1,00',
        '1,000',
    ];
    
    foreach ($tests as $test) {
        [$int, $sep, $dec] = preg_split('/.*\K([.,])/', $test, 3, PREG_SPLIT_DELIM_CAPTURE);
        var_dump($test, $int, $sep, $dec);
        echo "\n";
    }
    

    Output:

    string(8) "1,000.00"
    string(5) "1,000"
    string(1) "."
    string(2) "00"
    
    string(9) "1,000.000"
    string(5) "1,000"
    string(1) "."
    string(3) "000"
    
    string(8) "1.000,00"
    string(5) "1.000"
    string(1) ","
    string(2) "00"
    
    string(9) "1.000,000"
    string(5) "1.000"
    string(1) ","
    string(3) "000"
    
    string(4) "1.00"
    string(1) "1"
    string(1) "."
    string(2) "00"
    
    string(5) "1.000"
    string(1) "1"
    string(1) "."
    string(3) "000"
    
    string(4) "1,00"
    string(1) "1"
    string(1) ","
    string(2) "00"
    
    string(5) "1,000"
    string(1) "1"
    string(1) ","
    string(3) "000"
    

    If your goal is to standardize the placeholders, you could only translate the non-standard placeholders. Any dot that is later followed by a comma and any comma that is not later followed by a dot needs to be changed. Demo

    var_export(
        preg_replace_callback(
            '/\.(?=.*,)|,(?=\d*$)/',
            fn($m) => ['.' => ',', ',' => '.'][$m[0]],
            $tests
        )
    );
    

    To remove thousand placeholders and standardize decimal placeholders, you can make two passes over the string. Demo

    var_export(
        preg_replace(
            ['/[.,](?=\d*[.,])/', '/,/'],  // match thousands, match comma decimals
            ['',                  '.'],    // remove thousands, replace comma decimals
            $tests
        )
    );
    

    If you want a binary outcome that presents the thousands and decimal placeholders for a given string, you can use this: Demo

    $result = [];
    foreach ($tests as $test) {
        $result[] = preg_match('/,\d*$/', $test)
            ? ['input' => $test, 'thousands' => '.', 'decimal' => ',']
            : ['input' => $test, 'thousands' => ',', 'decimal' => '.'];
    }
    var_export($result);