Finding number with the smallest starting value that has a certain amount of steps/count

I am working on a Collatz Conjecture program. I have a sequence following the simple rule:

If the current term is even: the next term is half the current term.
If the current term is odd: the next term is three times the current term, plus 1.

We continue this until we reach one and count each term used.

For example: For the number 17, we have:

17 52 26 13 40 20 10 5 16 8 4 2 1

Hence, the count is 13.

What I specifically need assistance with is that I need to find the smallest starting value (number) which has a count over 1234.

ie:

1 has count 1
2 has count 2
3 has count 8
4 has count 3
5 has count 6
6 has count 9

......

18 has count 21

......

97 has count 119
etc

This is the code I have which I think works but takes extremely long to process. How can I do optimize this so it can find the count quicker? I was recommended a binary search but the count isn't linear so that doesn't work..

#include <stdio.h>

int main (void) {


    unsigned long long int num1 = 1;
    while (num1 <= 99999999999999999) {   

    unsigned long long int num = num1;
    int count = 1;
    while (num != 1) {

        if (num % 2 == 0) {
            num = (num/2);
            count++;
        }

        else {
            num = ((num*3) + 1);   
            count++;
        }

    }

    if (count > 1234) {
        printf("%llu\n", num1);
        break;
    }

    num1++;
    }

    return 0;
}

Solution

The smallest root that yields a Collatz sequence longer than 1234 steps is 133561134663. It reaches a maximum of 319497287463520, which is a 49-bit value. So, this sequence can be evaluated using uint64_t provided in <stdint.h> (usually via <inttypes.h>) by many C libraries.

Unfortunately, there are many sequences that start lower, that require up to 71-bit integers to correctly resolve, like the 573-step sequence that starts at 110243094271.

Most annoying is that if you implement Collatz sequences using 64-bit unsigned integers, without checking for overflow, you'll calculate the wrong lengths for those sequences; but because they happen to have uninteresting lengths, the bug is masked, and you can still find the aforementioned solution!

Essentially, if this is were a C exercise, it's buggy: even a horribly buggy and limited implementation can find the correct answer.

I happen to use GCC-5.4.0 on 64-bit Linux (running on an Core i5-7200U CPU on a laptop), so I can use unsigned __int128, a 128-bit unsigned integer type, for the Collatz sequence. I also have 16 GiB of RAM, of which I used about 12 GiB, to store the lengths of sequences up to 13 billion (13×10⁹), for odd indexes.

Whenever you find the length of the sequence starting at some odd i, the length of the sequence starting at 2i is one step longer (but otherwise the same). In fact, there is a sequence starting at 2^ki that is exactly k steps longer. If you are only looking for the smallest starting value for a sequence of some minimum length, you can ignore all even starting points until k is sufficiently small (essentially bracketing the range of starting values you need to look into).

Considering only even starting values would give a significant speedup (20% or more), but I drew my "line" here, and instead check every single starting value.

To speed up the calculation, I used the __builtin_ctz() GCC built-in (the __builtin_ctzll() for 64-bit unsigned long long) to find the number of consecutive least significant zero bits. That way I could handle all consecutive "even" steps in one go, but count the number of steps correctly.

I also split up the sequence stepper into two parallel ones, one handling the cases where the state fits in a 64-bit variable, and the other where full 128 bits are needed.

In the 64-bit only part, if the sequence length for the current state is cacheable, I look it up. If it is nonzero, I add that to the number of steps we have already done, and I have the result. Otherwise, I add that cache entry index, and the number of steps from the root, to a small update cache. That way, when I have the result, I do not need to repeat the sequence, and can simply apply the updates in a tight loop. Before multiplication by three and adding one, I do check that the value does not overflow (6148914691236517205 overflows, to 2⁶⁴); if it does, I switch to the 128-bit part, and do the multiply-add there instead.

In the 128-bit part, I assume that we don't have that much memory (in the exabyte range), so I don't need to worry about the caching part at all. Before I multiply by three and add one (which I do using curr128 += curr128 + curr128 + 1), I do check that the state does not overflow, just to be sure.

On a laptop with Intel Core i5-7200U, using a single core, it took about 4253 seconds of CPU time (about an hour and ten minutes) to calculate the length of all Collatz sequences that start from 1 up to 133561134663.

The next even longer sequence starts at 158294678119, is 1243 steps long; to get that far, my laptop burned an hour and a half of CPU time.

However, to get to that point, the code has become nothing like OP's, and is quite horrible. In particular, to intermix the 64-bit/128-bit looping, I had to resort to goto; and the entire implementation is littered with GCC builtins. I consider it very nearly write-only code; that is, if I suspected any part of it, I'd rewrite it from scratch. This is the typical end result when optimization is prioritized over maintainability and robustness.

Anyway, to let others using GCC on Linux on x86-64 verify the above, here is the horrible code, collatz.c:

#define  _POSIX_C_SOURCE  200809L
#define  _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <inttypes.h>
#include <string.h>
#include <stdarg.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>

#define  MAX_SEQ_LEN  2000

#define  OVER64       UINT64_C(6148914691236517205)

typedef  unsigned __int128   u128;
typedef  uint64_t            u64;

static inline u64  u128_lo(const u128  u) { return (u64)u;         }
static inline u64  u128_hi(const u128  u) { return (u64)(u >> 64); }

static inline u64 highest_bit_set_64(const u64  u)
{
    if (sizeof u == sizeof (unsigned int))
        return 63 - __builtin_clz(u);
    else
    if (sizeof u == sizeof (unsigned long))
        return 63 - __builtin_clzl(u);
    else
    if (sizeof u == sizeof (unsigned long long))
        return 63 - __builtin_clzll(u);
    else
        exit(EXIT_FAILURE);
}

static unsigned int highest_bit_set_128(u128 u)
{
    u64  hi = u128_hi(u);
    if (hi)
        return 64 + highest_bit_set_64(hi);
    else
        return highest_bit_set_64(u128_lo(u));
}

static inline unsigned int  ctz64(const u64  u)
{
    if (sizeof u <= sizeof (unsigned int))
        return __builtin_ctz(u);
    else
    if (sizeof u <= sizeof (unsigned long))
        return __builtin_ctzl(u);
    else
    if (sizeof u <= sizeof (unsigned long long))
        return __builtin_ctzll(u);
    else
        exit(EXIT_FAILURE);
}

static inline unsigned int  ctz128(const u128  u)
{
    if (sizeof u == sizeof (unsigned long long))
        return __builtin_ctzll(u);
    else {
        const u64  lo = u128_lo(u);
        if (lo)
            return ctz64(u);
        else
            return 64 + ctz64(u128_hi(u));
    }
}

static const char *s128(u128 u)
{
    static char  buffer[40];
    char        *p = buffer + sizeof buffer;

    *(--p) = '\0';
    do {
        *(--p) = '0' + (u % 10);
        u /= 10;
    } while (u);

    return p;
}

static struct timespec  wall_started, wall_now;
static inline void      wall_start(void)
{
    clock_gettime(CLOCK_MONOTONIC, &wall_started);
}
static inline double    wall_seconds(void)
{
    clock_gettime(CLOCK_MONOTONIC, &wall_now);
    return (double)(wall_now.tv_sec - wall_started.tv_sec)
         + (double)(wall_now.tv_nsec - wall_started.tv_nsec) / 1000000000.0;
}

static struct timespec  cpu_elapsed;
static inline double    cpu_seconds(void)
{
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &cpu_elapsed);
    return (double)cpu_elapsed.tv_sec + (double)cpu_elapsed.tv_nsec / 1000000000.0;
}

static int out_and_err = 0;
static void print(const char *fmt, ...)
{
    va_list  args;

    va_start(args, fmt);
    vdprintf(STDOUT_FILENO, fmt, args);
    va_end(args);

    if (out_and_err) {
        va_start(args, fmt);
        vdprintf(STDERR_FILENO, fmt, args);
        va_end(args);
    }
}

static size_t        cache_index[MAX_SEQ_LEN];
static unsigned int  cache_depth[MAX_SEQ_LEN];

static u64           seq_max = 0;  /* 2*seq_num */
static size_t        seq_num = 0;
static uint16_t     *seq_len = NULL;

static unsigned int  tip_bits = 0;
static u128          tip_max  = 0;

static inline unsigned int  collatz_length(const u64 root, u128 *maxto)
{
    u128          curr128, max128;
    u64           curr64, max64, lo;
    size_t        cached = 0, i;
    unsigned int  steps = 1, n;

    curr128 = max128 = root;
    curr64 = max64 = root;

any64bit:

    if (!(curr64 & 1)) {
        n = ctz64(curr64);
        curr64 >>= n;
        steps += n;
    }

    if (curr64 >= OVER64) {
        curr128 = curr64;
        goto odd128bit;
    }

odd64bit:

    if (curr64 <= 1)
        goto done;

    if (curr64 < seq_max) {
        i = curr64 >> 1;
        if (seq_len[i]) {
            steps += seq_len[i];
            goto done;
        }

        cache_index[cached] = i;
        cache_depth[cached] = steps;
        cached++;
    }

    curr64 += curr64 + curr64 + 1;
    steps++;

    if (max64 < curr64)
        max64 = curr64;

    goto any64bit;


any128bit:

    if (!(curr128 & 1)) {
        n = ctz128(curr128);
        curr128 >>= n;
        steps += n;

        if (!u128_hi(curr128)) {
            lo = u128_lo(curr128);
            if (lo <= 1)
                goto done;
            if (lo < OVER64) {
                curr64 = lo;
                goto odd64bit;
            }
        }
    }

odd128bit:

    if (u128_hi(curr128) >= OVER64) {
        print("Overflow at root %" PRIu64 ".\n", root);
        exit(EXIT_FAILURE);
    }

    curr128 += curr128 + curr128 + 1;
    steps++;

    if (max128 < curr128)
        max128 = curr128;

    goto any128bit;

done:

    if (cached >= MAX_SEQ_LEN) {
        print("Update cache overrun.\n");
        exit(EXIT_FAILURE);
    }

    while (cached-->0)
        seq_len[ cache_index[cached] ] = steps - cache_depth[cached];

    if (max128 < (u128)max64)
        max128 = max64;

    if (maxto)
        *maxto = max128;

    if (tip_max <= max128) {
        const unsigned int  maxbits = highest_bit_set_128(max128) + 1;
        tip_max = max128;
        if (tip_bits <= maxbits) {
            tip_bits = maxbits;
            print("%" PRIu64 " length %u (reaches %s - %u bits).\n", root, steps, s128(max128), maxbits);
        }
    }

    return steps;
}

int main(void)
{
    unsigned int  n, nmax = 0;
    u128          m;
    uint64_t      i = 1;

    wall_start();

    /* If standard output is redirected to a file, print everything to standard error also. */
    out_and_err = (isatty(STDERR_FILENO) && !isatty(STDOUT_FILENO));

    /* Try allocating up to 16 GiB of cache. */
    seq_num = (size_t)1024 * 1024 * 1024 * 16 / sizeof seq_len[0];
    while (1) {
        seq_len = malloc(seq_num * sizeof seq_len[0]);
        if (seq_len)
            break;
        seq_num = ( seq_num * 7 ) / 8;
    }
    seq_max = 2 * (uint64_t)seq_num;
    memset(seq_len,~0, seq_num * sizeof seq_len[0]);
    memset(seq_len, 0, seq_num * sizeof seq_len[0]);

    print("Allocated %zu entries (%.3f GiB)\n", seq_num, (double)(seq_num * sizeof seq_len[0]) / 1073741824.0);

    do {
        n = collatz_length(i, &m);
        if (n >= nmax) {
            const double cs = cpu_seconds();
            const double ws = wall_seconds();
            const char  *s = s128(m);
            nmax = n;
            print("%" PRIu64 " length %u (reaches %s) [%.3f seconds elapsed, %.3f seconds CPU time]\n", i, n, s, ws, cs);
        }

        i++;
    } while (nmax < MAX_SEQ_LEN);

    return EXIT_SUCCESS;
}

I compile and run it using gcc -Wall -O2 collatz.c -lrt -o collatz && ./collatz > results.txt. (Optimization ensures that for example the sizeof if clauses are resolved at compile time, and generates minimal number of slow conditional jumps, using conditional moves instead. So, the above program is designed to be compiled using -O2 at least.) It does contain extra code, for example to display the real time and CPU time used. As they are not related to the problem at hand, they do help TAs easily detect if someone were to try and submit this code as their own homework.

While it is working, I can use grep -gk 3 results.txt in another terminal to see the results obtained thus far, sorted in increasing sequence length; or grep -gk 5 results.txt to see the results sorted according to the peak in the sequence (largest value in the sequence).