I am working on a Collatz Conjecture program. I have a sequence following the simple rule:
If the current term is even: the next term is half the current term.
If the current term is odd: the next term is three times the current term, plus 1.
We continue this until we reach one and count each term used.
For example: For the number 17, we have:
17 52 26 13 40 20 10 5 16 8 4 2 1
Hence, the count is 13.
What I specifically need assistance with is that I need to find the smallest starting value (number) which has a count over 1234.
ie:
1 has count 1
2 has count 2
3 has count 8
4 has count 3
5 has count 6
6 has count 9
......
......
97 has count 119
etc
This is the code I have which I think works but takes extremely long to process. How can I do optimize this so it can find the count quicker? I was recommended a binary search but the count isn't linear so that doesn't work..
#include <stdio.h>
int main (void) {
unsigned long long int num1 = 1;
while (num1 <= 99999999999999999) {
unsigned long long int num = num1;
int count = 1;
while (num != 1) {
if (num % 2 == 0) {
num = (num/2);
count++;
}
else {
num = ((num*3) + 1);
count++;
}
}
if (count > 1234) {
printf("%llu\n", num1);
break;
}
num1++;
}
return 0;
}
The smallest root that yields a Collatz sequence longer than 1234 steps is 133561134663. It reaches a maximum of 319497287463520, which is a 49-bit value. So, this sequence can be evaluated using uint64_t
provided in <stdint.h>
(usually via <inttypes.h>
) by many C libraries.
Unfortunately, there are many sequences that start lower, that require up to 71-bit integers to correctly resolve, like the 573-step sequence that starts at 110243094271.
Most annoying is that if you implement Collatz sequences using 64-bit unsigned integers, without checking for overflow, you'll calculate the wrong lengths for those sequences; but because they happen to have uninteresting lengths, the bug is masked, and you can still find the aforementioned solution!
Essentially, if this is were a C exercise, it's buggy: even a horribly buggy and limited implementation can find the correct answer.
I happen to use GCC-5.4.0 on 64-bit Linux (running on an Core i5-7200U CPU on a laptop), so I can use unsigned __int128
, a 128-bit unsigned integer type, for the Collatz sequence. I also have 16 GiB of RAM, of which I used about 12 GiB, to store the lengths of sequences up to 13 billion (13×109), for odd indexes.
Whenever you find the length of the sequence starting at some odd i, the length of the sequence starting at 2i is one step longer (but otherwise the same). In fact, there is a sequence starting at 2ki that is exactly k steps longer. If you are only looking for the smallest starting value for a sequence of some minimum length, you can ignore all even starting points until k is sufficiently small (essentially bracketing the range of starting values you need to look into).
Considering only even starting values would give a significant speedup (20% or more), but I drew my "line" here, and instead check every single starting value.
To speed up the calculation, I used the __builtin_ctz() GCC built-in (the __builtin_ctzll()
for 64-bit unsigned long long
) to find the number of consecutive least significant zero bits. That way I could handle all consecutive "even" steps in one go, but count the number of steps correctly.
I also split up the sequence stepper into two parallel ones, one handling the cases where the state fits in a 64-bit variable, and the other where full 128 bits are needed.
In the 64-bit only part, if the sequence length for the current state is cacheable, I look it up. If it is nonzero, I add that to the number of steps we have already done, and I have the result. Otherwise, I add that cache entry index, and the number of steps from the root, to a small update cache. That way, when I have the result, I do not need to repeat the sequence, and can simply apply the updates in a tight loop. Before multiplication by three and adding one, I do check that the value does not overflow (6148914691236517205 overflows, to 264); if it does, I switch to the 128-bit part, and do the multiply-add there instead.
In the 128-bit part, I assume that we don't have that much memory (in the exabyte range), so I don't need to worry about the caching part at all. Before I multiply by three and add one (which I do using curr128 += curr128 + curr128 + 1
), I do check that the state does not overflow, just to be sure.
On a laptop with Intel Core i5-7200U, using a single core, it took about 4253 seconds of CPU time (about an hour and ten minutes) to calculate the length of all Collatz sequences that start from 1 up to 133561134663.
The next even longer sequence starts at 158294678119, is 1243 steps long; to get that far, my laptop burned an hour and a half of CPU time.
However, to get to that point, the code has become nothing like OP's, and is quite horrible. In particular, to intermix the 64-bit/128-bit looping, I had to resort to goto
; and the entire implementation is littered with GCC builtins. I consider it very nearly write-only code; that is, if I suspected any part of it, I'd rewrite it from scratch. This is the typical end result when optimization is prioritized over maintainability and robustness.
Anyway, to let others using GCC on Linux on x86-64 verify the above, here is the horrible code, collatz.c:
#define _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <inttypes.h>
#include <string.h>
#include <stdarg.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>
#define MAX_SEQ_LEN 2000
#define OVER64 UINT64_C(6148914691236517205)
typedef unsigned __int128 u128;
typedef uint64_t u64;
static inline u64 u128_lo(const u128 u) { return (u64)u; }
static inline u64 u128_hi(const u128 u) { return (u64)(u >> 64); }
static inline u64 highest_bit_set_64(const u64 u)
{
if (sizeof u == sizeof (unsigned int))
return 63 - __builtin_clz(u);
else
if (sizeof u == sizeof (unsigned long))
return 63 - __builtin_clzl(u);
else
if (sizeof u == sizeof (unsigned long long))
return 63 - __builtin_clzll(u);
else
exit(EXIT_FAILURE);
}
static unsigned int highest_bit_set_128(u128 u)
{
u64 hi = u128_hi(u);
if (hi)
return 64 + highest_bit_set_64(hi);
else
return highest_bit_set_64(u128_lo(u));
}
static inline unsigned int ctz64(const u64 u)
{
if (sizeof u <= sizeof (unsigned int))
return __builtin_ctz(u);
else
if (sizeof u <= sizeof (unsigned long))
return __builtin_ctzl(u);
else
if (sizeof u <= sizeof (unsigned long long))
return __builtin_ctzll(u);
else
exit(EXIT_FAILURE);
}
static inline unsigned int ctz128(const u128 u)
{
if (sizeof u == sizeof (unsigned long long))
return __builtin_ctzll(u);
else {
const u64 lo = u128_lo(u);
if (lo)
return ctz64(u);
else
return 64 + ctz64(u128_hi(u));
}
}
static const char *s128(u128 u)
{
static char buffer[40];
char *p = buffer + sizeof buffer;
*(--p) = '\0';
do {
*(--p) = '0' + (u % 10);
u /= 10;
} while (u);
return p;
}
static struct timespec wall_started, wall_now;
static inline void wall_start(void)
{
clock_gettime(CLOCK_MONOTONIC, &wall_started);
}
static inline double wall_seconds(void)
{
clock_gettime(CLOCK_MONOTONIC, &wall_now);
return (double)(wall_now.tv_sec - wall_started.tv_sec)
+ (double)(wall_now.tv_nsec - wall_started.tv_nsec) / 1000000000.0;
}
static struct timespec cpu_elapsed;
static inline double cpu_seconds(void)
{
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &cpu_elapsed);
return (double)cpu_elapsed.tv_sec + (double)cpu_elapsed.tv_nsec / 1000000000.0;
}
static int out_and_err = 0;
static void print(const char *fmt, ...)
{
va_list args;
va_start(args, fmt);
vdprintf(STDOUT_FILENO, fmt, args);
va_end(args);
if (out_and_err) {
va_start(args, fmt);
vdprintf(STDERR_FILENO, fmt, args);
va_end(args);
}
}
static size_t cache_index[MAX_SEQ_LEN];
static unsigned int cache_depth[MAX_SEQ_LEN];
static u64 seq_max = 0; /* 2*seq_num */
static size_t seq_num = 0;
static uint16_t *seq_len = NULL;
static unsigned int tip_bits = 0;
static u128 tip_max = 0;
static inline unsigned int collatz_length(const u64 root, u128 *maxto)
{
u128 curr128, max128;
u64 curr64, max64, lo;
size_t cached = 0, i;
unsigned int steps = 1, n;
curr128 = max128 = root;
curr64 = max64 = root;
any64bit:
if (!(curr64 & 1)) {
n = ctz64(curr64);
curr64 >>= n;
steps += n;
}
if (curr64 >= OVER64) {
curr128 = curr64;
goto odd128bit;
}
odd64bit:
if (curr64 <= 1)
goto done;
if (curr64 < seq_max) {
i = curr64 >> 1;
if (seq_len[i]) {
steps += seq_len[i];
goto done;
}
cache_index[cached] = i;
cache_depth[cached] = steps;
cached++;
}
curr64 += curr64 + curr64 + 1;
steps++;
if (max64 < curr64)
max64 = curr64;
goto any64bit;
any128bit:
if (!(curr128 & 1)) {
n = ctz128(curr128);
curr128 >>= n;
steps += n;
if (!u128_hi(curr128)) {
lo = u128_lo(curr128);
if (lo <= 1)
goto done;
if (lo < OVER64) {
curr64 = lo;
goto odd64bit;
}
}
}
odd128bit:
if (u128_hi(curr128) >= OVER64) {
print("Overflow at root %" PRIu64 ".\n", root);
exit(EXIT_FAILURE);
}
curr128 += curr128 + curr128 + 1;
steps++;
if (max128 < curr128)
max128 = curr128;
goto any128bit;
done:
if (cached >= MAX_SEQ_LEN) {
print("Update cache overrun.\n");
exit(EXIT_FAILURE);
}
while (cached-->0)
seq_len[ cache_index[cached] ] = steps - cache_depth[cached];
if (max128 < (u128)max64)
max128 = max64;
if (maxto)
*maxto = max128;
if (tip_max <= max128) {
const unsigned int maxbits = highest_bit_set_128(max128) + 1;
tip_max = max128;
if (tip_bits <= maxbits) {
tip_bits = maxbits;
print("%" PRIu64 " length %u (reaches %s - %u bits).\n", root, steps, s128(max128), maxbits);
}
}
return steps;
}
int main(void)
{
unsigned int n, nmax = 0;
u128 m;
uint64_t i = 1;
wall_start();
/* If standard output is redirected to a file, print everything to standard error also. */
out_and_err = (isatty(STDERR_FILENO) && !isatty(STDOUT_FILENO));
/* Try allocating up to 16 GiB of cache. */
seq_num = (size_t)1024 * 1024 * 1024 * 16 / sizeof seq_len[0];
while (1) {
seq_len = malloc(seq_num * sizeof seq_len[0]);
if (seq_len)
break;
seq_num = ( seq_num * 7 ) / 8;
}
seq_max = 2 * (uint64_t)seq_num;
memset(seq_len,~0, seq_num * sizeof seq_len[0]);
memset(seq_len, 0, seq_num * sizeof seq_len[0]);
print("Allocated %zu entries (%.3f GiB)\n", seq_num, (double)(seq_num * sizeof seq_len[0]) / 1073741824.0);
do {
n = collatz_length(i, &m);
if (n >= nmax) {
const double cs = cpu_seconds();
const double ws = wall_seconds();
const char *s = s128(m);
nmax = n;
print("%" PRIu64 " length %u (reaches %s) [%.3f seconds elapsed, %.3f seconds CPU time]\n", i, n, s, ws, cs);
}
i++;
} while (nmax < MAX_SEQ_LEN);
return EXIT_SUCCESS;
}
I compile and run it using gcc -Wall -O2 collatz.c -lrt -o collatz && ./collatz > results.txt
. (Optimization ensures that for example the sizeof
if clauses are resolved at compile time, and generates minimal number of slow conditional jumps, using conditional moves instead. So, the above program is designed to be compiled using -O2
at least.) It does contain extra code, for example to display the real time and CPU time used. As they are not related to the problem at hand, they do help TAs easily detect if someone were to try and submit this code as their own homework.
While it is working, I can use grep -gk 3 results.txt
in another terminal to see the results obtained thus far, sorted in increasing sequence length; or grep -gk 5 results.txt
to see the results sorted according to the peak in the sequence (largest value in the sequence).