Search code examples
pythonradix

How would I go about creating another int() function in Python so I can understand it?


I need to know how I would go about recreating a version of the int() function in Python so that I can fully understand it and create my own version with multiple bases that go past base 36. I can convert from a decimal to my own base (base 54, altered) just fine, but I need to figure out how to go from a string in my base's format to an integer (base 10).

Basically, I want to know how to go from my base, which I call base 54, to base 10. I don't need specifics, because if I have an example, I can work it out on my own. Unfortunately, I can't find anything on the int() function, though I know it has to be somewhere, since Python is open-source.

This is the closest I can find to it, but it doesn't provide source code for the function itself. Python int() test.

If you can help, thanks. If not, well, thanks for reading this, I guess?


Solution

  • You're not going to like this answer, but int(num, base) is defined in C (it's a builtin)

    I went searching around and found it:

    https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Objects/longobject.c

    /* Parses an int from a bytestring. Leading and trailing whitespace will be
     * ignored.
     *
     * If successful, a PyLong object will be returned and 'pend' will be pointing
     * to the first unused byte unless it's NULL.
     *
     * If unsuccessful, NULL will be returned.
     */
    PyObject *
    PyLong_FromString(const char *str, char **pend, int base)
    {
        int sign = 1, error_if_nonzero = 0;
        const char *start, *orig_str = str;
        PyLongObject *z = NULL;
        PyObject *strobj;
        Py_ssize_t slen;
    
        if ((base != 0 && base < 2) || base > 36) {
            PyErr_SetString(PyExc_ValueError,
                            "int() arg 2 must be >= 2 and <= 36");
            return NULL;
        }
        while (*str != '\0' && Py_ISSPACE(Py_CHARMASK(*str))) {
            str++;
        }
        if (*str == '+') {
            ++str;
        }
        else if (*str == '-') {
            ++str;
            sign = -1;
        }
        if (base == 0) {
            if (str[0] != '0') {
                base = 10;
            }
            else if (str[1] == 'x' || str[1] == 'X') {
                base = 16;
            }
            else if (str[1] == 'o' || str[1] == 'O') {
                base = 8;
            }
            else if (str[1] == 'b' || str[1] == 'B') {
                base = 2;
            }
            else {
                /* "old" (C-style) octal literal, now invalid.
                   it might still be zero though */
                error_if_nonzero = 1;
                base = 10;
            }
        }
        if (str[0] == '0' &&
            ((base == 16 && (str[1] == 'x' || str[1] == 'X')) ||
             (base == 8  && (str[1] == 'o' || str[1] == 'O')) ||
             (base == 2  && (str[1] == 'b' || str[1] == 'B')))) {
            str += 2;
            /* One underscore allowed here. */
            if (*str == '_') {
                ++str;
            }
        }
        if (str[0] == '_') {
            /* May not start with underscores. */
            goto onError;
        }
    
        start = str;
        if ((base & (base - 1)) == 0) {
            int res = long_from_binary_base(&str, base, &z);
            if (res < 0) {
                /* Syntax error. */
                goto onError;
            }
        }
        else {
    /***
    Binary bases can be converted in time linear in the number of digits, because
    Python's representation base is binary.  Other bases (including decimal!) use
    the simple quadratic-time algorithm below, complicated by some speed tricks.
    First some math:  the largest integer that can be expressed in N base-B digits
    is B**N-1.  Consequently, if we have an N-digit input in base B, the worst-
    case number of Python digits needed to hold it is the smallest integer n s.t.
        BASE**n-1 >= B**N-1  [or, adding 1 to both sides]
        BASE**n >= B**N      [taking logs to base BASE]
        n >= log(B**N)/log(BASE) = N * log(B)/log(BASE)
    The static array log_base_BASE[base] == log(base)/log(BASE) so we can compute
    this quickly.  A Python int with that much space is reserved near the start,
    and the result is computed into it.
    The input string is actually treated as being in base base**i (i.e., i digits
    are processed at a time), where two more static arrays hold:
        convwidth_base[base] = the largest integer i such that base**i <= BASE
        convmultmax_base[base] = base ** convwidth_base[base]
    The first of these is the largest i such that i consecutive input digits
    must fit in a single Python digit.  The second is effectively the input
    base we're really using.
    Viewing the input as a sequence <c0, c1, ..., c_n-1> of digits in base
    convmultmax_base[base], the result is "simply"
       (((c0*B + c1)*B + c2)*B + c3)*B + ... ))) + c_n-1
    where B = convmultmax_base[base].
    Error analysis:  as above, the number of Python digits `n` needed is worst-
    case
        n >= N * log(B)/log(BASE)
    where `N` is the number of input digits in base `B`.  This is computed via
        size_z = (Py_ssize_t)((scan - str) * log_base_BASE[base]) + 1;
    below.  Two numeric concerns are how much space this can waste, and whether
    the computed result can be too small.  To be concrete, assume BASE = 2**15,
    which is the default (and it's unlikely anyone changes that).
    Waste isn't a problem:  provided the first input digit isn't 0, the difference
    between the worst-case input with N digits and the smallest input with N
    digits is about a factor of B, but B is small compared to BASE so at most
    one allocated Python digit can remain unused on that count.  If
    N*log(B)/log(BASE) is mathematically an exact integer, then truncating that
    and adding 1 returns a result 1 larger than necessary.  However, that can't
    happen:  whenever B is a power of 2, long_from_binary_base() is called
    instead, and it's impossible for B**i to be an integer power of 2**15 when
    B is not a power of 2 (i.e., it's impossible for N*log(B)/log(BASE) to be
    an exact integer when B is not a power of 2, since B**i has a prime factor
    other than 2 in that case, but (2**15)**j's only prime factor is 2).
    The computed result can be too small if the true value of N*log(B)/log(BASE)
    is a little bit larger than an exact integer, but due to roundoff errors (in
    computing log(B), log(BASE), their quotient, and/or multiplying that by N)
    yields a numeric result a little less than that integer.  Unfortunately, "how
    close can a transcendental function get to an integer over some range?"
    questions are generally theoretically intractable.  Computer analysis via
    continued fractions is practical:  expand log(B)/log(BASE) via continued
    fractions, giving a sequence i/j of "the best" rational approximations.  Then
    j*log(B)/log(BASE) is approximately equal to (the integer) i.  This shows that
    we can get very close to being in trouble, but very rarely.  For example,
    76573 is a denominator in one of the continued-fraction approximations to
    log(10)/log(2**15), and indeed:
        >>> log(10)/log(2**15)*76573
        16958.000000654003
    is very close to an integer.  If we were working with IEEE single-precision,
    rounding errors could kill us.  Finding worst cases in IEEE double-precision
    requires better-than-double-precision log() functions, and Tim didn't bother.
    Instead the code checks to see whether the allocated space is enough as each
    new Python digit is added, and copies the whole thing to a larger int if not.
    This should happen extremely rarely, and in fact I don't have a test case
    that triggers it(!).  Instead the code was tested by artificially allocating
    just 1 digit at the start, so that the copying code was exercised for every
    digit beyond the first.
    ***/
            twodigits c;           /* current input character */
            Py_ssize_t size_z;
            Py_ssize_t digits = 0;
            int i;
            int convwidth;
            twodigits convmultmax, convmult;
            digit *pz, *pzstop;
            const char *scan, *lastdigit;
            char prev = 0;
    
            static double log_base_BASE[37] = {0.0e0,};
            static int convwidth_base[37] = {0,};
            static twodigits convmultmax_base[37] = {0,};
    
            if (log_base_BASE[base] == 0.0) {
                twodigits convmax = base;
                int i = 1;
    
                log_base_BASE[base] = (log((double)base) /
                                       log((double)PyLong_BASE));
                for (;;) {
                    twodigits next = convmax * base;
                    if (next > PyLong_BASE) {
                        break;
                    }
                    convmax = next;
                    ++i;
                }
                convmultmax_base[base] = convmax;
                assert(i > 0);
                convwidth_base[base] = i;
            }
    
            /* Find length of the string of numeric characters. */
            scan = str;
            lastdigit = str;
    
            while (_PyLong_DigitValue[Py_CHARMASK(*scan)] < base || *scan == '_') {
                if (*scan == '_') {
                    if (prev == '_') {
                        /* Only one underscore allowed. */
                        str = lastdigit + 1;
                        goto onError;
                    }
                }
                else {
                    ++digits;
                    lastdigit = scan;
                }
                prev = *scan;
                ++scan;
            }
            if (prev == '_') {
                /* Trailing underscore not allowed. */
                /* Set error pointer to first underscore. */
                str = lastdigit + 1;
                goto onError;
            }
    
            /* Create an int object that can contain the largest possible
             * integer with this base and length.  Note that there's no
             * need to initialize z->ob_digit -- no slot is read up before
             * being stored into.
             */
            double fsize_z = (double)digits * log_base_BASE[base] + 1.0;
            if (fsize_z > (double)MAX_LONG_DIGITS) {
                /* The same exception as in _PyLong_New(). */
                PyErr_SetString(PyExc_OverflowError,
                                "too many digits in integer");
                return NULL;
            }
            size_z = (Py_ssize_t)fsize_z;
            /* Uncomment next line to test exceedingly rare copy code */
            /* size_z = 1; */
            assert(size_z > 0);
            z = _PyLong_New(size_z);
            if (z == NULL) {
                return NULL;
            }
            Py_SIZE(z) = 0;
    
            /* `convwidth` consecutive input digits are treated as a single
             * digit in base `convmultmax`.
             */
            convwidth = convwidth_base[base];
            convmultmax = convmultmax_base[base];
    
            /* Work ;-) */
            while (str < scan) {
                if (*str == '_') {
                    str++;
                    continue;
                }
                /* grab up to convwidth digits from the input string */
                c = (digit)_PyLong_DigitValue[Py_CHARMASK(*str++)];
                for (i = 1; i < convwidth && str != scan; ++str) {
                    if (*str == '_') {
                        continue;
                    }
                    i++;
                    c = (twodigits)(c *  base +
                                    (int)_PyLong_DigitValue[Py_CHARMASK(*str)]);
                    assert(c < PyLong_BASE);
                }
    
                convmult = convmultmax;
                /* Calculate the shift only if we couldn't get
                 * convwidth digits.
                 */
                if (i != convwidth) {
                    convmult = base;
                    for ( ; i > 1; --i) {
                        convmult *= base;
                    }
                }
    
                /* Multiply z by convmult, and add c. */
                pz = z->ob_digit;
                pzstop = pz + Py_SIZE(z);
                for (; pz < pzstop; ++pz) {
                    c += (twodigits)*pz * convmult;
                    *pz = (digit)(c & PyLong_MASK);
                    c >>= PyLong_SHIFT;
                }
                /* carry off the current end? */
                if (c) {
                    assert(c < PyLong_BASE);
                    if (Py_SIZE(z) < size_z) {
                        *pz = (digit)c;
                        ++Py_SIZE(z);
                    }
                    else {
                        PyLongObject *tmp;
                        /* Extremely rare.  Get more space. */
                        assert(Py_SIZE(z) == size_z);
                        tmp = _PyLong_New(size_z + 1);
                        if (tmp == NULL) {
                            Py_DECREF(z);
                            return NULL;
                        }
                        memcpy(tmp->ob_digit,
                               z->ob_digit,
                               sizeof(digit) * size_z);
                        Py_DECREF(z);
                        z = tmp;
                        z->ob_digit[size_z] = (digit)c;
                        ++size_z;
                    }
                }
            }
        }
        if (z == NULL) {
            return NULL;
        }
        if (error_if_nonzero) {
            /* reset the base to 0, else the exception message
               doesn't make too much sense */
            base = 0;
            if (Py_SIZE(z) != 0) {
                goto onError;
            }
            /* there might still be other problems, therefore base
               remains zero here for the same reason */
        }
        if (str == start) {
            goto onError;
        }
        if (sign < 0) {
            Py_SIZE(z) = -(Py_SIZE(z));
        }
        while (*str && Py_ISSPACE(Py_CHARMASK(*str))) {
            str++;
        }
        if (*str != '\0') {
            goto onError;
        }
        long_normalize(z);
        z = maybe_small_long(z);
        if (z == NULL) {
            return NULL;
        }
        if (pend != NULL) {
            *pend = (char *)str;
        }
        return (PyObject *) z;
    
      onError:
        if (pend != NULL) {
            *pend = (char *)str;
        }
        Py_XDECREF(z);
        slen = strlen(orig_str) < 200 ? strlen(orig_str) : 200;
        strobj = PyUnicode_FromStringAndSize(orig_str, slen);
        if (strobj == NULL) {
            return NULL;
        }
        PyErr_Format(PyExc_ValueError,
                     "invalid literal for int() with base %d: %.200R",
                     base, strobj);
        Py_DECREF(strobj);
        return NULL;
    }
    

    If you want to defined it in C, go ahead and try using this- if not, you'll have to write it yourself