How dynamic languages efficiently handle data types

February 14, 2022

How dynamic languages efficiently handle data types

Dynamically typed languages provide near complete freedom with the type of data we choose to store in a variable, pass to a function call or return to a caller. While some languages still provide strong type guarantees at runtime (Python, Lua, etc.), others are the wild west of type systems (JavaScript).

Some questions that programmers often find themselves asking include: "What does a dynamic variable/value really look like under the hood?", "How does my language know how much space it needs to spare for a variable in memory if it can belong to any data type?". Hopefully, you'll click off this page having found some answers.

To follow the code samples in this post, you'll only need basic knowledge of C and familiarity with any duck typed languages like JavaScript, Lua, Ruby, PHP, or Python.

How an interpreted language operates at runtime

Nearly all interpreted programming languages have an intermediate representation called "bytecode". Bytecode is a set of very low-level instructions meant to command a virtual machine, commonly abbreviated to "VM". A single bytecode instruction usually translates to a set of simple tasks.

VMs are primarily of two kinds: stack based and register based. A stack based VM uses a stack as its primary data structure for storing and fetching values. A register based VM uses 'registers', which are often just named indexes into an array. For the sake of efficiency, VMs are commonly written in low-level compiled languages, which are commonly statically typed (Rust, C, C++, Zig). Therefore it follows that our registers or stack must be of a uniform data type capable of representing multiple data types within itself.

The crux of the problem boils down to this:

We need to define a static data type in our implementation language to store dynamic values for our interpreted language.

Every dynamic language has a fixed set of data types. For JavaScript, the data types are function, object, string, number, boolean, symbol, undefined, BigInt, and null. Every value that can possibly be represented in JS belongs to one of these categories. Arrays too are of type object (as can be seen by evaluating typeof [1, 2] === "object").

So the number of data types that we need to represent is a compile time constant. Let's consider an interpreter for a programming language that has number, boolean, object, and null (wherein object engulfs strings, arrays, hashtables, and other heap data structures).

With this newfound insight and using C as our implementation language, our problem can be restated as:

Create a data structure in C that can hold a heap object, a boolean, a null value, or a number.

Wait... isn't that just a struct? Not exactly, but it's a good place to start.

Bloated struct


#include 

typedef struct Value_t Value;
struct Value_t {
  double number;
  bool boolean;
  // We leave the implementation of `HeapObject` to your imagination :^)
  HeapObject* object;
};

So far, so good. Now let us try to imagine how an interpreter would add two numbers represented as above. Consider a function that accepts two values and returns their sum:


Value add_values(Value const* a, Value const* b) {
  Value sum;
  sum.number = a.number + b.number;
  return sum;
}

But what if a and b are not numbers? The semantics of some languages like JS dictate that the + operator can be used on values of any type. If a and b happen to be strings, we should concatenate them instead. Since dynamic values can change their type at runtime, the interpreter needs some way of knowing what data type a Value struct carries within it at any given time. For this purpose, we introduce a new enum that represents the data types in our language:


typedef enum { T_NUM, T_OBJECT, T_BOOL } Type;

struct Value {
  Type type; // determines the type of a value
  double number;
  bool boolean;
  HeapObject* object; // heap allocated strings, arrays, symbols etc.
};

With that, we can change our add function to behave more sensibly:


Value add_values(Value const* a, Value const* b) {
  Value sum;
  if (a.type == T_NUM && b.type == T_NUM) {
    sum.type = T_NUM;
    sum.number = a.number + b.number;
  } else if (a.type == T_OBJECT && b.type == T_OBJECT) {
    if (a.object->is_string() && b.object->is_string()) {
      sum.type = T_OBJECT;
      sum.string = concat_strings(a, b);
    } else {
      // ...
    }
  }
  return sum;
}

At this point, it should be clear that an interpreter can carry out all the operations it needs to with a data structure like this. But we can do better than a bloated struct.

Tagged union

The reason we provide our struct with multiple fields is that our values can change their data types at any moment. However, a Value can only carry one type of data at any given time because of our language's semantics. A value cannot be a number and a string simultaneously. To demonstrate:


let a = 1
console.log(typeof a) // number
a = 'x'
console.log(typeof a) // string

This means when the object field of a Value struct carries meaningful data, all the other fields are doing no work and just eating up space! We need to store a double or a bool or a HeapObject*, but never more than one of those. The space we're allocating for every Value is sizeof(bool) + sizeof(HeapObject*) + sizeof(bool) + sizeof(Type) bytes big.


printf("%d bytes\n", sizeof(Value)); // 32 bytes

Since we only store one out of all possible data types, we only need to allocate enough space to store the largest. Meaning we can work with sizeof(Type) + max(sizeof(bool), sizeof(double), sizeof(HeapObject*) ) bytes of space instead. To make this optimization, we can use a union type like so:


struct Value_t {
  Type type;
  union {
    double number;
    bool boolean;
    HeapObject* object;
  } data;
};


printf("%d bytes\n", sizeof(Value)); // 16 bytes

That just saved us a handful of bytes! The implementation for the add_values function now calls for slight tweaks:


Value add_values(Value const* a, Value const* b) {
  Value sum;
  if (a.type == T_NUM && b.type == T_NUM) {
    sum.type = T_NUM;
    sum.data.number = a.data.number + b.data.number;
  } else if (a.type == T_OBJECT && b.type == T_OBJECT) {
    sum.type        = T_OBJECT;
    sum.data.object = concat_string(a.data.object, b.data.object);
  } else if (...) {
    // .. handle other cases
  }
  return sum;
}

Tagged unions are a value representation that many mainstream interpreters use today. Here is a shortened snippet of the value representation used by the Lua interpreter:


typedef union {
  GCObject *gc;
  void *p;
  lua_Number n;
  int b;
} Value;

typedef struct lua_TValue {
  Value value; // union
  int tt;      // type tag
} TValue;

You will find a similar value representation in the CPython interpreter as well.

Tagged unions are a good middle ground and a popular choice for interpreters. But if you're a hardcode speedfreak, keep reading.

NaN boxing

16 bytes is sweet, but some smart compiler engineers argued we can do better. NaN boxing (ab)uses the IEEE-754 standard for floating points. The details are fairly nuanced, but what we need to know can be expressed neatly via this labeled diagram (courtesy of Wikipedia):

A double precision float's bits are divided into 3 sections: A 1 bit sign, 11 bit exponent, and 52 bit mantissa. The exact interpretation of these bits isn't important for our purposes and can be found in the standard for the curious reader. What we need to know is the way to represent NaN - a value that means "not a number", generally emerging as the result of bad arithmetic operations like division by zero. The standard dictates that any 64-bit float with all its exponent bits set to 1 is considered a NaN. If the 52nd bit of a NaN is set, it is a quiet NaN, which does not throw any exceptions as opposed to a signaling NaN.

As long as the quiet and exponent bits are set, most CPUs identify the double as a NaN and do not care about the other bits in it. This means we can use the remaining bits to store any information we please while using regular doubles (that aren't NaNs) to store numeric values in our language.

Note: We use the terms "double" and "64-bit float" interchangeably in this post.

We first split the types of values we aim to represent into 3 convenient categories:

Immediate/Singleton values: Values that are singletons and can be accessed without chasing pointers (null, true, false, etc.).
Heap allocated data: Values that live on the heap, in a JS-like language this means functions, arrays, strings, and key-value pair objects.
Numbers: A 64-bit floating point value, a regular number in our language.

With this classification, the rules for our new value representation can be laid down clearly:

When a double is not a NaN, we use it as a regular number in our language.
If a double is a NaN and its sign bit is 1, it is a pointer to a HeapObject.
If a double is a NaN and its sign bit is 0, it is an immediate value whose type can be determined by examining its 3-bit tag (explained below).

Type tags

Now that numbers and objects can be identified solely by the NaN/sign bits of a double, we only need type tags (previously Type) for singleton values like null, true, false, etc. We reserve 3-bits in the mantissa and map each bit pattern to a type tag to achieve this.

000 -> NaN (an actual quiet nan produced in our language)
001 -> null
010 -> true
011 -> false

Should you need it, there is room for 4 more type tags. Remember that tags are only necessary when dealing with singleton values. For objects (whose sign bit is 1), the tag bits are not needed and may not be set.

Pointers

Out of 64, 13 bits are reserved to identify our pointer a quiet nan with its sign bit set to 1. How are we going to represent a 64-bit pointer in the remaining 51 bits?

As it turns out, most CPU architectures only use 48 bits to address memory, leaving the other bits untouched. That means the spare bits we're left with are all we're going to need on major architectures. A pointer can be encoded as 0xfff8XXXXXXXXXXXX where the Xs are bits of the memory address, and fff8 is the sign and nan bits.

Equipped with this knowledge, we can begin implementing a NaN boxed value representation. We will often need to inspect our 64-bit value as a raw sequence of bits and even more often as a double precision float. For this, we use a union again:


typedef union BoxedValue_t BoxedValue;
union BoxedValue_t {
  uint64_t bits; // to inspect the raw bits
  double number; // to interpret the value as an IEEE float
};

// sizeof(BoxedValue) is 8

The union allows us to view the same bunch of bits under different lenses. An alternative trick is to use a cross-type pointer dereference (*(uint64_t*)(&double_value)), or to memcpy the bits of one type into another and rely on the compiler to optimize the memcpy call away. For now, we stick to unions.

Then we define a bunch of handy bit masks which aid us in extracting out certain parts of a double:


#define MASK_SIGN  0x8000000000000000UL // to extract the sign bit
#define MASK_EXP   0x7ff0000000000000UL // to extract the exponent bits
#define MASK_QNAN  0x0008000000000000UL // to extract the quiet-nan bit
#define MASK_TAG   0x0007000000000000UL // to extract the 3 bit type tag
#define MASK_PTR   0x0000ffffffffffffUL // to extract the pointer bits

Now given any 64-bit float, we are able to get its type tag using num & MASK_TAG, and similarly all the other segments as well. Next, we map type names to their type tag values:


#define TAG_NULL  0x0001000000000000UL // 001
#define TAG_TRUE  0x0002000000000000UL // 010
#define TAG_FALSE 0x0003000000000000UL // 011

Some accessors and tag-checking macros for convenience:


#define IS_NUMBER(v) (!isnan(v.number))
#define IS_OBJECT(v) (isnan(v.number) && (MASK_SIGN && (v.bits)))
#define AS_NUMBER(v) (v.number)
#define AS_OBJECT(v) ((void*)((v.bits) & MASK_PTR))

And finally some sentinel constants:


#define QNAN  0xfff8000000000000
#define KNULL (QNAN | TAG_NULL) // `NULL` is reserved by C stdlib
#define TRUE  (QNAN | TAG_TRUE)
#define FALSE (QNAN | TAG_FALSE)

To facilitate creating nan-boxed values, we also define a bunch of helpers:


// These could just as easily have been macros, but the compiler can optimize away
// trivial calls anyway, and making them functions leads to better error messages.

BoxedValue make_num(double value) {
  return (BoxedValue){.number = value};
}

BoxedValue make_imm(uint64_t value) {
  assert(value == QNAN || value == TRUE || value == FALSE);
  // An immediate value can be identified purely based on the NaN bits
  return (BoxedValue){.bits = value};
}

BoxedValue make_ptr(void *ptr) {
  BoxedValue value;
  value.bits = QNAN | (uint64_t)(ptr);
  return value;
}

Now, to demonstrate how our add_values function changes:


BoxedValue add_values(BoxedValue a, BoxedValue b) {
  if (IS_NUMBER(a) && IS_NUMBER(b)) {
    BoxedValue sum = make_num(AS_NUMBER(a) + AS_NUMBER(b));
    return sum;
  } else if (IS_OBJECT(a) && IS_OBJECT(a)) {
    // you know the drill...
  }
  // ...
}

The full implementation of NaN tagging with tests can be found in this gist. The wren programming language uses this same technique to encode its values, as can be seen from its source on GitHub.

Lua's 5.2 interpreter also uses this trick on x86, evident here.

Word of caution - Not all CPU architectures guarantee the 52 bits of a float will be left untouched when producing and handling quiet NaNs. So if you're looking to use this trick, also have a tagged union implementation as a backup when NaN boxing isn't supported.

Pointer tagging

There exists yet another value representation used in both hot and popular interpreters (V8, SpiderMonkey, JSC) as well as ancient VMs of Smalltalk and LISP — "pointer tagging".

Pointer tagging exploits the fact that 64-bit pointers are 8-byte aligned. This means every memory address contained in a pointer is a multiple of 8. In binary, every multiple of 8 has its 3 lowest bits set to 0. We can use these 3 spare bits to store our type tags for singleton values.

If we put our doubles on the heap, we can now represent all value types using only a pointer!


typedef Object* Value;

A notable difference between NaN boxing and pointer tagging is that 64-bit numbers are allocated elsewhere in the heap in this representation. We use tags to differentiate between heap objects, numbers and singleton values. Let's roll out a couple of macros to represent these tags:


#define TAG_HEAP  0b000 // an Object*
#define TAG_TRUE  0b001 // singleton - true
#define TAG_FALSE 0b010 // singleton - false
#define TAG_NULL  0b011 // singleton - null
#define TAG_NUM   0b100 // pointer to a double

#define MASK_TAG 0x0000000000000007

#define BITS(val) (uint64_t)(val)
#define GET_TAG(val) (BITS(val) & MASK_TAG)
#define SET_TAG(val, tag) (val = (Value)((BITS(val) ^ MASK_TAG) | tag))

Value make_num(double num) {
  Value value = (Value)malloc(sizeof(double));
  *(double*)value = num;
  SET_TAG(value, TAG_NUM);
  return value;
}

Just like NaN-tagging, pointer tagging is unreliable owing to its dependence on aligned memory addresses. If you understood NaN boxing well enough, you should be able to implement pointer tagging fairly easily. If you get stuck, this blog post by Max Bernstein serves as a solid reference.

What's next?

And with that, we've covered all the major value representations that compiler engineers have used over the ages. Next time a static typing crusader boasts of the compactness of their value representations, you know what to silence them with.

That said, it is important that we see the bigger picture. These aren't just clever tricks to make faster programming languages, but novel data compaction techniques used in several facets of engineering.

So when you click off this page, you're leaving not with useless compiler engineering circus tricks but real life optimization patterns.