Search code examples
pythonnlp

Convert numbers to English strings


Websites like http://www.easysurf.cc/cnvert18.htm and http://www.calculatorsoup.com/calculators/conversions/numberstowords.php tries to convert a numerical string into an english strings, but they are giving natural sounding output.

For example, on http://www.easysurf.cc/cnvert18.htm:

[in]: 100456
[out]:  one hundred  thousand four hundred fifty-six

this website is a little better, http://www.calculator.org/calculate-online/mathematics/text-number.aspx:

[in]: 100456
[out]: one hundred thousand, four hundred and fifty-six

[in]: 10123124001
[out]: ten billion, one hundred and twenty-three million, one hundred and twenty-four thousand, one 

but it breaks at some point:

[in]: 10000000001
[out]: ten billion, , , one 

I've wrote my own version but it involves lots of rules and it caps at one billion, from http://pastebin.com/WwFCjYtt:

import codecs

def num2word (num):
  ones = {1:"one",2:"two",3:"three",4:"four",
          5:"five",6:"six",7:"seven",8:"eight",
          9:"nine",0:"zero",10:"ten"}
  teens = {11:"eleven",12:"twelve",13:"thirteen",
           14:"fourteen",15:"fifteen"}
  tens = {2:"twenty",3:"thirty",4:"forty",
          5:"fifty",6:"sixty",7:"seventy",
          8:"eighty",9:"ninety"}
  lens = {3:"hundred",4:"thousand",6:"hundred",7:"million",
          8:"million", 9:"million",10:"billion"#,13:"trillion",11:"googol",
          }

  if num > 999999999:
    return "Number more than 1 billion"

  # Ones
  if num < 11:
    return ones[num]
  # Teens
  if num < 20:
    word = ones[num%10] + "teen" if num > 15 else teens[num]
    return word
  # Tens
  if num > 19 and num < 100:
    word = tens[int(str(num)[0])]
    if str(num)[1] == "0":
      return word
    else:
      word = word + " " + ones[num%10]
      return word

  # First digit for thousands,hundred-thousands.
  if len(str(num)) in lens and len(str(num)) != 3:
    word = ones[int(str(num)[0])] + " " + lens[len(str(num))]
  else:
    word = ""

  # Hundred to Million  
  if num < 1000000:
    # First and Second digit for ten thousands.  
    if len(str(num)) == 5:
      word = num2word(int(str(num)[0:2])) + " thousand"
    # How many hundred-thousand(s).
    if len(str(num)) == 6:
      word = word + " " + num2word(int(str(num)[1:3])) + \
            " " + lens[len(str(num))-2]
    # How many hundred(s)?
    thousand_pt = len(str(num)) - 3
    word = word + " " + ones[int(str(num)[thousand_pt])] + \
            " " + lens[len(str(num))-thousand_pt]
    # Last 2 digits.
    last2 = num2word(int(str(num)[-2:]))
    if last2 != "zero":
      word = word + " and " + last2
    word = word.replace(" zero hundred","")
    return word.strip()

  left, right = '',''  
  # Less than 1 million.
  if num < 100000000:
    left = num2word(int(str(num)[:-6])) + " " + lens[len(str(num))]
    right = num2word(int(str(num)[-6:]))
  # From 1 million to 1 billion.
  if num > 100000000 and num < 1000000000:
    left = num2word(int(str(num)[:3])) +  " " + lens[len(str(num))]
    right = num2word(int(str(num)[-6:]))
  if int(str(num)[-6:]) < 100:
    word = left + " and " + right
  else:  
    word = left + " " + right
  word = word.replace(" zero hundred","").replace(" zero thousand"," thousand")
  return word

print num2word(int(raw_input("Give me a number:\n")))

How can I make the script i've wrote accept > billion?

Is there any other way to get the same output?

Can my code be written in a less verbose way?


Solution

  • A more general approach to this problem uses repeated division (i.e. divmod) and only hardcodes the special/edge cases necessary.

    For example, divmod(1034393, 1000000) -> (1, 34393), so you've effectively found the number of millions and are left with a remainder for further calculations.

    Possibly more illustrative example: divmod(1034393, 1000) -> (1034, 393) which allows you to take off groups of 3 decimal digits at a time from the right.

    In English we tend to group digits in threes, and similar rules apply. This should be parameterized and not hard coded. For example, "303" could be three hundred and three million, three hundred and three thousand, or three hundred and three. The logic should be the same except for the suffix, depending on what place you're in. Edit: looks like this is sort of there due to recursion.

    Here is a partial example of the kind of approach I mean, using a generator and operating on integers rather than doing lots of int(str(i)[..]) everywhere.

    say_base = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
        'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
        'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen']
    
    say_tens = ['', '', 'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy',
        'eighty', 'ninety']
    
    def hundreds_i(num):
        hundreds, rest = divmod(num, 100)
        if hundreds:
            yield say_base[hundreds]
            yield ' hundred'
        if 0 < rest < len(say_base):
            yield ' and '
            yield say_base[rest]
        elif rest != 0:
            tens, ones = divmod(rest, 10)
            yield ' and '
            yield say_tens[tens]
            if ones > 0:
                yield '-'
                yield say_base[ones]
    
    assert "".join(hundreds_i(245)) == "two hundred and forty-five"
    assert "".join(hundreds_i(999)) == 'nine hundred and ninety-nine'
    assert "".join(hundreds_i(200)) == 'two hundred'