In the search of CPU free lunch; a DOD perspective

In the last post we shown how important data manipulation can be in the efficiency of programs. More specifically, we shown that in some cases, the data access is much slower than the instructions we run on the CPU side. The last post was a show-case for data-oriented design (DOD).

But we can turn the problem upside down, and ask ourselves: assuming data layout is optimal, can we add more CPU instructions to be used while waiting for memory? The answer appears to be yes, within some limits.

This is not a show-case of data-oriented design; rather it’s an exploration inside the realm of DOD, to investigate some finer points of DOD. And, compared to last post, we venture outside of game industry to emphasize the point that DOD techniques can be applied in other industries as well.

Are there free CPU cycles? A first estimation

The major conclusions of the last post were based on the fact that on the modern processors the data access rate is slower than the rate we can execute instructions on the CPU. This calls for the table with the numbers that every programmer should know:

Operation	Time
1 cycle in a 2GHz processor	0.5 ns
L1 cache reference	0.5 ns
Branch mispredict	5 ns
L2 cache reference	7 ns
Mutex lock/unlock	25 ns
Main memory reference	100 ns
Compress 1K bytes w/ cheap algorithm	3,000 ns
Send 2K bytes over 1 Gbps network	20,000 ns
Read 1 MB sequentially from memory	250,000 ns
Round trip within same datacenter	500,000 ns
Disk seek	10,000,000 ns
Read 1 MB sequentially from disk	20,000,000 ns
Send packet CA->Netherlands->CA	150,000,000 ns

See original article by Jeff Dean, and an up-to-date version of these numbers.

Typical CPU instructions take below 10 ns. A main memory access takes in the order of 100 ns. There is a gap here that we can potentially exploit. However, please note that it may not be as simple as it sounds; it’s not like we can completely separate memory access from the CPU instructions. Also, at this scale, it is hard to separate throughput from latency.

Let us start measuring this.

A simple example

Consider a vector of integers with a lot of values (I’ve tested with 1,000,000 values). Let’s write a loop to increment all the values in this vector:

for (int i = 0; i < values.size(); i++)
    values[i]++;

From a DOD point of view, we are ok: we are using the most compact data structure for our problem, and we are not doing more work than we need.

But now, as a test, let’s increment only part of these values. Instead of incrementing i by 1, let’s add a variable step:

for (int i = 0; i < values.size(); i+=step)
    values[i]++;

Plotting the time needed to execute this (repeating this 1000 times) while varying the step yields the following chart:

We observe 3 main areas: the first part, in which the time drops significantly, then an almost plateau phase, in which performance seems to not be affected by decreasing step, and finally a phase in which the time decreases again with step.

These three phases can be seen better if we would also plot the “expected time” for each of this; we compute this expected time as:

\[expectedTime(step) = \frac{t(0)}{step}\]

Plotting this along the measured time, would yield:

The plateau phase seems to be while step is between 4 and 16. On my machine that would correspond to a stride between 16 and 64 bytes. Does it surprise you to know that the cache line on my machine is 64 bytes? Once again, memory access seems to be the limiting factor here. It doesn’t quite matter if we are making 4 increments or 16 increments per cache line; we are mostly paying for the cache line access.

Now that we know that 16 is a special step for our machine, let’s also plot the expected-time computed starting from the 16th point:

\[expectedTime_{16}(step) = \frac{t(16) * 16}{step}\]

It seems that in the third stage, the time decreases roughly proportional with the step. That makes sense if we stop to think about it. Once we passed the cache line boundary, we are not bringing into memory all pages; we will update only one element for each page that we load, so, increasing the step, we will change the stride of cache lines that we are loaded.

From this analysis we can draw the following conclusion:

Conclusion so far

Within the boundary of a page line, the number of integer increments that we have does not quite affect performance (if we go after the initial 4 increments). We can do 16 increments in the same time we can do 4 increments.

Playing on the plateau

Let us play some more on the plateau part of our measurements (steps between 4 and 16). Let’s artificially add more CPU instructions. We will replace the increment with a multiplication, then with two multiplications, three and four multiplications, then we will also add a fastSqrt transformation. For each of our 5 new cases, the instruction inside the loop will look like the following:

// instead of values[i]++;
// case 1:
values[i] = values[i] * values[i];
// case 2:
values[i] = values[i] * values[i] * values[i];
// case 3:
values[i] = values[i] * values[i] * values[i] * values[i];
// case 4:
values[i] = values[i] * values[i] * values[i] * values[i] * values[i];
// case 5:
values[i] = fastSqrt(values[i]);

In the last case, instead of using standard sqrtf we used a function that appears to be faster, at the expense of some precision. The reason for choosing this method is to prevent the compiler for transforming the instruction into a SSE-specific instruction and thus adding translation overhead. The function we are using is:

float fastSqrt(float x) {
    union {
        int32_t i;
        float x;
    } u;
    u.x = x;
    u.i = (1 << 29) + (u.i >> 1) - (1 << 22);
    return u.x;
}

Measuring the performance for these 5 new cases, will yield the following results:

Just by looking at this graph, we conclude the following:

for low step values, there is no space for us to squeeze more instructions
for higher step values (but still in the realm of a cache line), there is place for us to insert extra CPU instructions
the mul1 and mul2 cases match pretty well the original line

So, there is some slack on the instruction side, but that slack is very limited: just a few multiplications.

For the purpose of this exposition, we are not looking at the generated assembly, and how does the compiler behave with this piece of code. However, please note that it’s not just a pure separation between CPU instructions and data accesses. After all, we are also doing stores with the results of our injected instructions. In other words, it’s hard to increase the throughput of CPU instructions without also affecting the latency.

Conclusion so far

It is tough to increase the throughput of CPU instructions.

A real-world example

Let’s now turn to an example inspired by a real-life project. And this time, it’s inspired from the automotive industry, not the gaming industry.

Let’s assume that we have 2 components in our application: one that is responsible for management of some data, and one that uses data provided by the first component for some business processes.

The business logic part needs to be very efficient. So the data that is passed to it needs to be in the right format for using it. For the purpose of our example, let’s assume the data is just an array of orientation vectors:

struct Vec3 {
    double x;
    double y;
    double z;
};

This is a 3D vector with a length of 1.

To make the problem slightly more interesting, let’s assume that Data manager needs to always concatenate two arrays before passing this to business logic. The process will look like:

void prepareData(const vector<Vec3>& s1, const vector<Vec3>& s2,
        vector<Vec3>& out) {
    int sz1 = int(s1.size());
    int sz2 = int(s2.size());
    int idxDst = 0;
    for (int i = 0; i < sz1; i++, idxDst++)
        out[idxDst] = s1[i];
    for (int i = 0; i < sz2; i++, idxDst++)
        out[idxDst] = s2[i];
}

Let’s take this as the baseline, and explore various alternatives starting with this process.

One may try to use memcpy here, and it will work faster than these manually written loops, but let’s exclude this from our repertoire (imagine that the real-world problem has certain twists that will make memcpy not usable).

Doing a data analysis, there is nothing wrong with our data layout. We are using arrays and we are not wasting memory; both inputs and the output are properly compacted.

Please remember that the business logic is critical in terms of performance, so whatever we do, we cannot change the format of the output data. Thus, changing double to float for the output is not an option.

By the way we set up the problem, the only thing that we can do is to add more CPU instructions here. Let us explore some possibilities of adding useful stuff in this transformation. Let’s try to add some basic geometry transformations:

Vec3 negateXY(Vec3 v) { return Vec3{-v.x, -v.y, v.z}; }
Vec3 rotateZ30(Vec3 v) {
    constexpr float cos_theta = 0.866025f;
    constexpr float sin_theta = 0.5f;
    return Vec3{v.x * cos_theta - v.y * sin_theta, v.x * sin_theta + v.y * cos_theta, v.z};
}
void prepareData_neg(const vector<Vec3>& s1, const vector<Vec3>& s2,
        vector<Vec3>& out) {
    int sz1 = int(s1.size());
    int sz2 = int(s2.size());
    int idxDst = 0;
    for (int i = 0; i < sz1; i++, idxDst++)
        out[idxDst] = negateXY(s1[i]);
    for (int i = 0; i < sz2; i++, idxDst++)
        out[idxDst] = negateXY(s2[i]);
}
// similar for prepareData_rotate()

For both transformation we are making a few (floating point) arithmetic operations, and no extra memory operation.

Measuring the performance of these transformations with respect to our baseline, we get:

That is quite good. Not only we added more operations to be performed while concatenating the data, but we also improved performance.

The reasons for the improved performance are accidental; the compiler has more hints to go through a SSE based translation, and that will perform better on my machine. This performance difference depends on the compiler and the machine. For the purpose of this post, we won’t enter into details about this.

Conclusion

Sometimes, adding a small of CPU instructions will make your program faster.

Let us build on that.

We said that, for performance reasons are are not allowed to change the output data structure. But, we can still change the input structure. The reason for such a change would be to better optimize the data manager component (this was actually the use case that inspired this post).

There are several assumptions that we can make to help in our problem:

precision is not that important; we are able to sacrifice a few percents in the direction vectors
the Z component of the vector is always positive (we are always pointing upwards, instead of downwards)

With these assumptions, we can represent the vector into a much more compact structure:

struct Vec3Compact {
    float x;
    float y;

    Vec3Compact() {}
    Vec3Compact(float xx, float yy) : x{xx} , y{yy} {}
    Vec3Compact(Vec3 v) : x{(float)v.x}, y{(float)v.y} {}

    explicit operator Vec3() const {
        return Vec3{x, y, 1.0 - fastSqrt(x * x + y * y)};
    }
};

First thing to notice is that we dropped some of the precision, by moving from double to float. The second thing, is that we got rid of the Z component. We can do this because a direction vector is normalized, so that:

\[x^2+y^2+z^2 = 1\]

and thus:

\[z = \pm (1 - \sqrt{x^2+y^2})\]

One of our assumption was that Z is always positive, so we ignored the plus/minus variation. If we wouldn’t had this assumption we would have to add an extra bit of information, and add some more CPU instruction for the translations between Vec3 and Vec3Compact.

In terms of memory utilization, this structure has a 66% size reduction compared with the original one.

With this, we can change the input arrays to be of Vec3Compact instead of Vec3, and the performance of our prepareData concatenation would be:

We achieve even more performance improvement. A simple concatenation of two Vec3Compact arrays and converting it into an array of Vec3 objects would be about 23% faster than the case in which we just concatenate two Vec3 arrays. Please note that in this process we reduced the data size for the input vectors, so in consequence we have fewer memory misses on the input vectors.

Also, as a side effect, the data manager component, having smaller data structures, will probably benefit from improved performance. For example, reading the data from disk is faster if we are reading 66% less data.

For reference, here is the code that does the concatenation of arrays of compact vectors:

void concat_compact(
        const vector<Vec3Compact>& s1, const vector<Vec3Compact>& s2,
        vector<Vec3>& out) {
    int sz1 = int(s1.size());
    int sz2 = int(s2.size());
    int idxDst = 0;
    for (int i = 0; i < sz1; i++, idxDst++)
        out[idxDst] = (Vec3)s1[i];
    for (int i = 0; i < sz2; i++, idxDst++)
        out[idxDst] = (Vec3)s2[i];
}

Adding negate and rotate transformation on this compact method yields some interesting results:

The negate variant will perform slightly worse than the base compact version, but still better than the non-compact version. So, once again, we can piggyback more CPU instructions on a data traversal without affecting the performance too much.

The rotate version however produces worse results even than the original non-compact version. In this case, we are adding slightly more instructions. Trying to skip some of the computations will yield better results, but still not extremely good – see the “c rotate-skip” bar. In this case, we only applied the transformation once every 8 vectors; that is, we apply the transformation twice per cache line.

Source code and measurements machine

As the last time, all the code is available on GitHub.

The performance numbers shown here were obtained on my MacBook Pro Retina (late 2013), 2GHz Intel Core i7 (Haswell/Crystalwell), 8 GB Ram, macOS High Sierra. The cache line is 64 bytes, L1 is 32K, L2 is 256K and L3 cache is 6 MB.

Conclusions

This is an exploratory post. Instead of proving a point, it explores various aspects related to the existence of a CPU free lunch in the presence of compact data structures, as required by Data-Oriented Design. Even though there is no strong point that we can make, there are still several conclusions that we can draw from this exploratory work:

there is a CPU free lunch; we can add more CPU instructions without degrading performance
adding more CPU instructions can sometimes lead to improved performance; if this is coupled with a reduction of data size
the range in which we can add more CPU instructions without affecting performance is small
compiler optimizations pay an important role in these type of operations, and can affect performance significantly

We’ve proven that there is a CPU free lunch, but it appears one cannot feast on it. What else is left to do? We shall further explore this technique for other applications.

Keep truthing!

CompuTruthing