Beyond the type system
Often when we consider a type system we think about a complex formalism to denote the semantics of a language. Languages typically add features by increasing the complexity of the type system.
But there is also another perspective on the type system: how does the generated code look like? what is the ABI for the code that uses different features of the type system? We’ll try to explore this perspective in this blog post.
Our perspective
For the purpose of this post, we assume a narrow perspective when analyzing the C++ type system. Assuming that we have a MyType
class type, we are interested in how the compiler passes this around in various contexts, how this type is adjusted based on the context need.
We care about the distinction between value type, reference type and pointer type, but we also look at r-value references and we briefly touch on C++’s value categories (glvalue, rvalue, lvalue, xvalue, prvalue).
On top of these, we are also briefly interested in the const
modifier (and we don’t care about the volatile
modifier for types).
From the all possible usages of types in C++, we will look at:
- defining variables
- fields in a structure/class
- function parameters
- function return value
For all these, we write C++ code and then we use a clang compiler to inspect the generated LLVM code. Of course, different compiler may have different rules, they generate different code. But this is not our main goal here; we don’t strive for completeness, but rather for some insights on how the compiler may do it.
Values, references, pointers
Let us take a trivial example:
If we compile this with clang, the LLVM output would be something similar to:
The compiler will not properly name the variables, but we can easily guess which is which. The variable %1
corresponds to the value type; the rest are variables %2
to %5
. What is immediately visible is that there is no difference between references and pointers; moreover, const
modifier doesn’t change the LLVM output. All these references and pointers get translated into simple pointers by LLVM. For each of this pointer, the initialization is translated in a store
instruction.
At this level, the constness of types is completely ignored. That’s why, from now on, we won’t be looking into const values, references and pointer anymore.
There is another important fact proved by this example. Look at the constructor call for the value. It takes the object to be constructed as a pointer. Actually, all the LLVM alloca
functions declare pointers instead of values. The type of %1
is %struct.MyType*
, and not %struct.MyType
; similarly, the type of %2
is %struct.MyType**
, and not %struct.MyType*
. In the case of references and pointers, clang actually transforms them into double pointers.
This is the first indicator that lvalues (all these C++ variables are lvalues) are translated into pointers.
lvalues, xvalues, prvalues
Let us take another example to show the difference between different value categories:
The resulting LLVM translation looks like:
Based on this, the following conclusions can be taken:
- lvalues and prvalues behave the same; they both result into a pointer created by the compiler
- xvalue requires both a pointer and a pointer-to-pointer
- std::move does an extra dereference compared to just a rvalue-reference creation
- xvalues (rvalue-references) behave like regular user-defied references
This way, we can arrange the value categories based on the underlying types in the following table:
Value category | Underlying type | Observation |
---|---|---|
lvalue | MyType* |
|
prvalue | MyType* |
|
xvalue | MyType** |
Just like references |
One point that keeps recurring through this post is that there is no pure value types in the ABI (exceptions apply).
Members of structures
Let’s look at the following C++ code using structures:
It defines a structure containing multiple values of different types. The point of this example is to check how different attributes of the structs are being used. The corresponding LLVM code looks like:
If we just look at the LLVM structure definition, things may be confusing. In there we have the first member (apparently) as pure value, and the rest of the members as pointers. But we need to look at how the structure is used to actually determine the proper values of the struct members.
The key point here is that whenever we need to access a struct member, we have two things: a handle to the structure itself, and the actual member in the structure (offset + type). That is, the final type of a struct member is always given by the combination of the structure access type and the underlying member type. For our example, we can summarize the results as:
Member type | Structure access | Underlying type | Combined type | Observation |
---|---|---|---|---|
MyType |
MyStruct* |
MyType |
MyType* |
Just like regular values |
MyType& |
MyStruct* |
MyType* |
MyType** |
Just like regular references |
MyType* |
MyStruct* |
MyType* |
MyType** |
Just like regular references |
MyType&& |
MyStruct* |
MyType* |
MyType** |
Just like regular references |
Indeed, if we look at the generated LLVM code, we see that each time we want to access a member (to call process
) we have either a MyType*
(first case), or a MyType**
(look at the extra load
instruction).
Again, pointers everywhere; no pure value types in underlying ABI.
Passing arguments to functions
Consider the following C++ functions:
These get translated into LLVM in the following way:
The function that just takes an int
as an argument is different from the rest of the other functions. It takes the parameter by value, but then it creates a temporary variable (i32*
) and copies the value of the argument into it. This will enable the users to dereference the parameter inside the function (&val
). Although we actually pass this by value, we always have a local variable that we can use.
All the rest of the functions take a MyType*
as parameter. Please note that byval
and dereferenceable
are only hints for the optimizer.
Inside of these functions, we use the parameter in different ways. That is why there is a difference in the body of the second function and the bodies of the last 3 functions. In the last 3 functions, we are actually creating new variables (MyType**
).
This can be summarized by the following table:
Parameter type | Underlying type | Local variable underlying type | Observation |
---|---|---|---|
int |
i32 |
i32* |
|
MyType |
MyType* |
N/A | No need for local var; just use param |
MyType& |
MyType* |
MyType** |
|
MyType* |
MyType* |
MyType** |
|
MyType&& |
MyType* |
MyType** |
Here, we need to discuss why do we have the int
parameter passed by value, but an MyType
parameter is actually passed with a pointer. In general, the compiler will pass a value through a pointer whenever one of the two conditions are met:
- object identity may be used in the definition of the object (i.e., object is not a POD)
- object size is larger than a given threshold.
If, for example, we would change the definition of MyType
to:
then we would get the following translation to LLVM of our pass-by-value function:
As one can see, the compiler will pass the value of MyType
as two 64-bit integers, and then recompose the object inside the function.
Of course, this works on a compiler in which an int
is translated to i32
. Other compilers may have different rules / thresholds.
In general, we can treat this as an optimization. We can assume that every pass-by-value will be translated into a pass-by-pointer in the underlying ABI, and sometimes the compiler can optimize on this.
If we ignore this optimization, we can conclude that all parameters are passed by pointers in the underlying ABI.
Returning from functions
Let’s now turn the attention towards expressing the return values in the ABI. Let’s look at the following C++ code:
This gets translated to:
Here, the translation rules are a little bit more interesting, but they still follow the same pattern. The results are summarized by the following table:
Return type | Actual return | Additional parameter | Observation |
---|---|---|---|
MyType |
void |
MyType* |
Return through an output param |
MyType& |
MyType* |
N/A | |
MyType* |
MyType* |
N/A | |
MyType&& |
MyType* |
N/A |
The function that returns a value is transformed into a function that returns void
and takes the location of where to write the resulting value as a pointer to the function – there is an implicit (sret
) parameter to the underlying function.
Again, ignoring the optimization for small POD types, the compiler will not use value-types for returning values from a function. There are pointers everywhere.
If we were to make a function that returns a smaller POD type (i.e., int
), then the compiler will actually generate a proper value return for it. But again, we can consider that as an optimization.
But why not values?
For non-trivial user-defined types, the compiler will not translate into values. Everything is a pointer down at generated code level. The language and the compiler just maintain the illusion of value types. Why can’t it just use values?
Besides reasons that involve performance, there is also a semantic reason for which a general type needs to be translated as pointer in all these contexts.
An object can store inside it (directly or indirectly) a pointer to itself. And there is no good way for the compiler to determine this.
Let me give an example, to clarify things. Let us consider the example of a double-linked list; we ignore genericity, and consider that the list stores simple int
s. The data structures would look like the following:
For implementing this List
, we have the following conventions:
- we keep a sentinel (node that is not actually part of the list)
- sentinel’s
next
will point to the first element in the list - sentinel’s
prev
will point to the last element in the list - we don’t want to make a heap allocation for the sentinel, so we store it directly in the
List
structure
There are actually many implementations that follow the same idea. So far so good.
Let us consider now how do we represent the empty list:
- it obviously has a sentinel (as it’s embedded in the
List
object) sentinel->next == &sentinel
– the absence of the first element is marked with the sentinelsentinel->prev == &sentinel
– the absence of the last element is marked with the sentinel
Now, because sentinel
is the only member of List
, the address of a List
object will be equal to the address of its sentinel. If the list is empty, the sentinel will point to itself (twice), and therefore will also point to the List
object.
Therefore, in our empty list example, all List
objects will contain two pointers to themselves. The same is true for non-empty lists, but there are multiple levels of indirections.
The compiler cannot figure out whether a List
object will point at itself or not without following the logic of how the List
objects are constructed and used.
In general, for an object that has a user-defined constructor, the compiler may assume that the object may actually be containing its own address. Therefore it cannot simply copy the value bits from one place to the other.
This is why, for most of the types, the compiler needs to maintain pointers everywhere when translating to machine code.
Brief conclusions and some unanswered questions
C++ has a complex type system. It distinguishes between values, references and pointers. Also from glvalues, rvalues, lvalues, xvalues, and prvalues. There are also const and non-const types, and also volatile and non-volatile types. All this complexity boils down to just a few options in the generated code.
The most important conclusion, is that the compiler generates almost everywhere pointer types for everything that the user writes. For non-trivial objects the compiler needs to guarantee that the object identity always remain in sync. In most cases, the compiler just provides the illusion of value semantics.
Considering all these, the following questions can be used to further explore this topic:
- is the C++ type system more complex than it needs to be? could it have been simpler?
- is the distinction between
T
andconst T&
really needed? Can’t the compiler just use the same semantics for both? (i.e., avoid copying, and simplifying syntax) - do we really need both references and pointers?
Keep truthing!
Comments