c++c++11 types programming-languages valgrind

C++ and dynamically typed languages

Today I talked to a friend about the differences between statically and dynamically typed languages (more info about the difference between static and dynamic typed languages in this SO question). After that, I was wondering what kind of trick can be used in C++ to emulate such dynamic behavior.

In C++, as in other statically typed languages, the variable type is specified at compile time. For example, let's say I have to read from a file a big amount of numbers, which are in the majority of the cases quite small, small enough to fit in an unsigned short type. Here comes the tricky thing, a small amount of these values are much bigger, bigger enough to need an unsigned long long to be stored.

Since I assume I'm going to do calculations with all of them I want all of them stored in the same container in consecutive positions of memory in the same order than I read them from the input file.. The naive approach would be to store them in a vector of type unsigned long long, but this means having typically up to 4 times extra space of what is actually needed (unsigned short 2 bytes, unsigned long long 8 bytes).

In dynamically typed languages, the type of a variable is interpreted at runtime and coerced to a type where it fits. How can I achieve something similar in C++?

My first idea is to do that by pointers, depending on its size I will store the number with the appropriate type. This has the obvious drawback of having to also store the pointer, but since I assume I'm going to store them in the heap anyway, I don't think it matters.

I'm totally sure that many of you can give me way better solutions than this ...

#include <iostream>
#include <vector>
#include <limits>
#include <sstream>
#include <fstream>

int main() {
    std::ifstream f ("input_file");
    if (f.is_open()) {
        std::vector<void*> v;
        unsigned long long int num;
        while(f >> num) {
            if (num > std::numeric_limits<unsigned short>::max()) {
                v.push_back(new unsigned long long int(num));
            }
            else {
                v.push_back(new unsigned short(num));
            }
        }
        for (auto i: v) {
            delete i;
        }
    f.close();
    }
}

Edit 1: The question is not about saving memory, I know in dynamically typed languages the necessary space to store the numbers in the example is going to be way more than in C++, but the question is not about that, it's about emulating a dynamically typed language with some c++ mechanism.

Solution

Options include...

Discriminated union

The code specifies a set of distinct, supported types T0, T1, T2, T3..., and - conceptually - creates a management type to

struct X
{
    enum { F0, F1, F2, F3... } type_;
    union { T0 t0_; T1 t1_; T2 t2_; T3 t3_; ... };
};

Because there are limitations on the types that can be placed into unions, and if they're bypassed using placement-new care needs to be taken to ensure adequate alignment and correct destructor invocation, a generalised implementation becomes more complicated, and it's normally better to use boost::variant<>. Note that the type_ field requires some space, the union will be at least as large as the largest of sizeof t0_, sizeof t1_..., and padding may be required.

std::type_info

It's also possible to have a templated constructor and assignment operator that call typeid and record the std::type_info, allowing future operations like "recover-the-value-if-it's-of-a-specific-type". The easiest way to pick up this behaviour is to use boost::any.

Run-time polymorphism

You can create a base type with virtual destructor and whatever functions you need (e.g. virtual void output(std::ostream&)), then derive a class for each of short and long long. Store pointers to the base class.

Custom solutions

In your particular scenario, you've only got a few large numbers: you could do something like reserve one of the short values to be a sentinel indicating that the actual value at this position can be recreated by bitwise shifting and ORing of the following 4 values. For example...

10 299 32767 0 0 192 3929 38

...could encode:

10
299
// 32767 is a sentinel indicating next 4 values encode long long
(0 << 48) + (0 << 32) + (192 << 16) + 3929
38

The concept here is similar to UTF-8 encoding for international character sets. This will be very space efficient, but it suits forward iteration, not random access indexing a la [123].