How dynamic languages efficiently handle data types
Dynamically typed languages provide near complete freedom with the type of data we choose to store in a variable, pass to a function call or return to a caller. While some languages still provide strong type guarantees at runtime (Python, Lua, etc.), others are the wild west of type systems (JavaScript).
Some questions that programmers often find themselves asking include: "What does a dynamic variable/value really look like under the hood?", "How does my language know how much space it needs to spare for a variable in memory if it can belong to any data type?". Hopefully, you'll click off this page having found some answers.
To follow the code samples in this post, you'll only need basic knowledge of C and familiarity with any duck typed languages like JavaScript, Lua, Ruby, PHP, or Python.
How an interpreted language operates at runtime
Nearly all interpreted programming languages have an intermediate representation called "bytecode". Bytecode is a set of very low-level instructions meant to command a virtual machine, commonly abbreviated to "VM". A single bytecode instruction usually translates to a set of simple tasks.
VMs are primarily of two kinds: stack based and register based. A stack based VM uses a stack as its primary data structure for storing and fetching values. A register based VM uses 'registers', which are often just named indexes into an array. For the sake of efficiency, VMs are commonly written in low-level compiled languages, which are commonly statically typed (Rust, C, C++, Zig). Therefore it follows that our registers or stack must be of a uniform data type capable of representing multiple data types within itself.
The crux of the problem boils down to this:
We need to define a static data type in our implementation language to store dynamic values for our interpreted language.
Every dynamic language has a fixed set of data types. For JavaScript, the data types are function, object, string, number, boolean, symbol, undefined, BigInt, and null. Every value that can possibly be represented in JS belongs to one of these categories. Arrays too are of type object (as can be seen by evaluating typeof [1, 2] === "object").
So the number of data types that we need to represent is a compile time constant. Let's consider an interpreter for a programming language that has number, boolean, object, and null (wherein object engulfs strings, arrays, hashtables, and other heap data structures).
With this newfound insight and using C as our implementation language, our problem can be restated as:
Create a data structure in C that can hold a heap object, a boolean, a null value, or a number.
Wait... isn't that just a struct? Not exactly, but it's a good place to start.
Bloated struct
So far, so good. Now let us try to imagine how an interpreter would add two numbers represented as above. Consider a function that accepts two values and returns their sum:
But what if a and b are not numbers? The semantics of some languages like JS dictate that the + operator can be used on values of any type. If a and b happen to be strings, we should concatenate them instead. Since dynamic values can change their type at runtime, the interpreter needs some way of knowing what data type a Value struct carries within it at any given time. For this purpose, we introduce a new enum that represents the data types in our language:
With that, we can change our add function to behave more sensibly:
At this point, it should be clear that an interpreter can carry out all the operations it needs to with a data structure like this. But we can do better than a bloated struct.
Tagged union
The reason we provide our struct with multiple fields is that our values can change their data types at any moment. However, a Value can only carry one type of data at any given time because of our language's semantics. A value cannot be a number and a string simultaneously. To demonstrate:
This means when the object field of a Value struct carries meaningful data, all the other fields are doing no work and just eating up space! We need to store a double or a bool or a HeapObject*, but never more than one of those. The space we're allocating for every Value is sizeof(bool) + sizeof(HeapObject*) + sizeof(bool) + sizeof(Type) bytes big.
Since we only store one out of all possible data types, we only need to allocate enough space to store the largest. Meaning we can work with sizeof(Type) + max(sizeof(bool), sizeof(double), sizeof(HeapObject*) ) bytes of space instead. To make this optimization, we can use a union type like so:
That just saved us a handful of bytes! The implementation for the add_values function now calls for slight tweaks:
Tagged unions are a value representation that many mainstream interpreters use today. Here is a shortened snippet of the value representation used by the Lua interpreter:
You will find a similar value representation in the CPython interpreter as well.
Tagged unions are a good middle ground and a popular choice for interpreters. But if you're a hardcode speedfreak, keep reading.
NaN boxing
16 bytes is sweet, but some smart compiler engineers argued we can do better. NaN boxing (ab)uses the IEEE-754 standard for floating points. The details are fairly nuanced, but what we need to know can be expressed neatly via this labeled diagram (courtesy of Wikipedia):
A double precision float's bits are divided into 3 sections: A 1 bit sign, 11 bit exponent, and 52 bit mantissa. The exact interpretation of these bits isn't important for our purposes and can be found in the standard for the curious reader. What we need to know is the way to represent NaN - a value that means "not a number", generally emerging as the result of bad arithmetic operations like division by zero. The standard dictates that any 64-bit float with all its exponent bits set to 1 is considered a NaN. If the 52nd bit of a NaN is set, it is a quiet NaN, which does not throw any exceptions as opposed to a signaling NaN.
As long as the quiet and exponent bits are set, most CPUs identify the double as a NaN and do not care about the other bits in it. This means we can use the remaining bits to store any information we please while using regular doubles (that aren't NaNs) to store numeric values in our language.
Note: We use the terms "double" and "64-bit float" interchangeably in this post.
We first split the types of values we aim to represent into 3 convenient categories:
- Immediate/Singleton values: Values that are singletons and can be accessed without chasing pointers (null, true, false, etc.).
- Heap allocated data: Values that live on the heap, in a JS-like language this means functions, arrays, strings, and key-value pair objects.
- Numbers: A 64-bit floating point value, a regular number in our language.
With this classification, the rules for our new value representation can be laid down clearly:
- When a double is not a NaN, we use it as a regular number in our language.
- If a double is a NaN and its sign bit is 1, it is a pointer to a HeapObject.
- If a double is a NaN and its sign bit is 0, it is an immediate value whose type can be determined by examining its 3-bit tag (explained below).
Type tags
Now that numbers and objects can be identified solely by the NaN/sign bits of a double, we only need type tags (previously Type) for singleton values like null, true, false, etc. We reserve 3-bits in the mantissa and map each bit pattern to a type tag to achieve this.
- 000 -> NaN (an actual quiet nan produced in our language)
- 001 -> null
- 010 -> true
- 011 -> false
Should you need it, there is room for 4 more type tags. Remember that tags are only necessary when dealing with singleton values. For objects (whose sign bit is 1), the tag bits are not needed and may not be set.
Pointers
Out of 64, 13 bits are reserved to identify our pointer a quiet nan with its sign bit set to 1. How are we going to represent a 64-bit pointer in the remaining 51 bits?
As it turns out, most CPU architectures only use 48 bits to address memory, leaving the other bits untouched. That means the spare bits we're left with are all we're going to need on major architectures. A pointer can be encoded as 0xfff8XXXXXXXXXXXX where the Xs are bits of the memory address, and fff8 is the sign and nan bits.
Equipped with this knowledge, we can begin implementing a NaN boxed value representation. We will often need to inspect our 64-bit value as a raw sequence of bits and even more often as a double precision float. For this, we use a union again:
The union allows us to view the same bunch of bits under different lenses. An alternative trick is to use a cross-type pointer dereference (*(uint64_t*)(&double_value)), or to memcpy the bits of one type into another and rely on the compiler to optimize the memcpy call away. For now, we stick to unions.
Then we define a bunch of handy bit masks which aid us in extracting out certain parts of a double:
Now given any 64-bit float, we are able to get its type tag using num & MASK_TAG, and similarly all the other segments as well. Next, we map type names to their type tag values:
Some accessors and tag-checking macros for convenience:
And finally some sentinel constants:
To facilitate creating nan-boxed values, we also define a bunch of helpers:
Now, to demonstrate how our add_values function changes:
The full implementation of NaN tagging with tests can be found in this gist. The wren programming language uses this same technique to encode its values, as can be seen from its source on GitHub.
Lua's 5.2 interpreter also uses this trick on x86, evident here.
Word of caution - Not all CPU architectures guarantee the 52 bits of a float will be left untouched when producing and handling quiet NaNs. So if you're looking to use this trick, also have a tagged union implementation as a backup when NaN boxing isn't supported.
Pointer tagging
There exists yet another value representation used in both hot and popular interpreters (V8, SpiderMonkey, JSC) as well as ancient VMs of Smalltalk and LISP — "pointer tagging".
Pointer tagging exploits the fact that 64-bit pointers are 8-byte aligned. This means every memory address contained in a pointer is a multiple of 8. In binary, every multiple of 8 has its 3 lowest bits set to 0. We can use these 3 spare bits to store our type tags for singleton values.
If we put our doubles on the heap, we can now represent all value types using only a pointer!
A notable difference between NaN boxing and pointer tagging is that 64-bit numbers are allocated elsewhere in the heap in this representation. We use tags to differentiate between heap objects, numbers and singleton values. Let's roll out a couple of macros to represent these tags:
Just like NaN-tagging, pointer tagging is unreliable owing to its dependence on aligned memory addresses. If you understood NaN boxing well enough, you should be able to implement pointer tagging fairly easily. If you get stuck, this blog post by Max Bernstein serves as a solid reference.
What's next?
And with that, we've covered all the major value representations that compiler engineers have used over the ages. Next time a static typing crusader boasts of the compactness of their value representations, you know what to silence them with.
That said, it is important that we see the bigger picture. These aren't just clever tricks to make faster programming languages, but novel data compaction techniques used in several facets of engineering.
So when you click off this page, you're leaving not with useless compiler engineering circus tricks but real life optimization patterns.