Posts tagged C
Posts tagged C
I am currently in the process of rewriting a simple 2D game engine from C++ to C. Having always been a C++ programmer I am used to think in terms of classes. So I have done this in C as well. So here is an excerpt from how I originally wrote my header file for dealing with 2D vectors.
typedef struct _Vec
{
real x, y;
} Vec;
//- Constructors
Vec vec(real x, real y);
//- Calculations
Vec vecAdd(Vec u, Vec v);
Vec vecSub(Vec u, Vec v);
Vec vecMul(Vec v, real factor);
This is pretty much how a C++ guy would do it. Data definition and associated operations are put in the same place. Most other “classes” which takes 2D vectors as input would include this header file. However it gradually dawned upon me that this was completely unnecessary. Those other header files only needed to know the memory layout of a 2D vector. They didn’t need to know the associated functions. So I moved the struct definition out of the file and only included that in other header files which defined functions accepting 2D vectors. An example is my 2D matrix class. The excerpt below shows how it was originally. The header file included all the 2D vector functions (as a C++ class would).
#include "geometry/vec.h"
Mat matIdentity();
Mat matTranslate(Vec v);
Mat matMul(Mat m, Mat n);
Vec matMulVec(Mat m, Vec n);
But with my new realization geometry/vec.h was replaced by geometry/types.h which defined all the structs for geometry types like 2D vector, 2D matrix, circle, rectangle etc. geometry/vec.h was instead included in mat.c. This has major implications for large programs where compile times often can get very high because header files are re-processed countless times. I am not claiming this is any ingenious insight of mine. It is not by a long stretch. On the contrary it should be pretty darn obvious and a lot of C programmers are probably laughing at me for pointing out such an obvious thing, but having done C++ programming for too many years might have caused some brain damage. But hopefully this was worthwhile being pointed out to other brain damaged C++ developers.
Nor am I saying that this problem can’t be solved in C++, but it requires a lot more diligence. The Visualization Toolkit (VTK) is a good example. They mandate that you only use pointers to objects. That way you can forward declare everything, except the class you inherit. Most big C++ projects probably wont realize this until it is too late, and they have by then gotten too used to using nice STL and Boost stuff which doesn’t play well this mindset.
Inserting and retrieving elements in an STL vector is just as fast as doing it in a regular plain C array. This of course requires that you have turn on optimization.
However a STL vector has a number of attributes which potentially makes it slower than a C array:
A vector does not store its data in a continuous chunk of memory. Pointers to the start and end of the vector will be stored in a different location from the data itself.
An STL vector always allocates memory for its data on the heap. Heap allocation is slower than stack allocation.
To test how this would work in practice I created a program that would iterate over a large array, do some calculations on it and return small parts of it into either a plain C array or a STL vector.
for (int k = 0; k < rep; ++k) {
for (int i = 0; i < count; i += inc) {
double buffer[inc];
base->GetData(Range(i, i + inc - 1), buffer);
}
}
base is an object containing and array of data. We access parts of this array by specifying a range. The data in the range is returned in buffer. To test the STL vector we use this code in the inner loop instead>
vector<double> buffer(inc);
base->GetData(Range(i, i + inc - 1), buffer.begin());
A simple calculation is performed on each element
void Data::GetData(Range r, double *buffer) {
double *p = d + r.min;
double *q = d + r.max + 1;
while (p != q) {
double a = ((*p)+3) * (*p) + 4; // simple calc
*buffer++ = a;
++p;
}
}
The rep variable is used to perform everything several times over so that the cache will be warmed up. inc defines the size of the buffer used to store results from GetData. count is the total number of element. In all the tests we have kept count to a constant of 1 million double values. By varying the size inc of the buffer we get these results:
The scale is logarithmic along X and Y axis. At low buffer size we get an overhead in the virtual function call for the C array buffer. That is why larger buffers give better performance. However only buffers of 5-10 in size is needed for maximum performance. For STL vector we need buffers with size 1000 before we get comparable performance to plain C array. The reason for this is the overhead of the memory allocation and deallocation on the heap. This creates a very high overhead for small buffer sizes.
As the buffers get above 150 000 elements we see performance degrading again. This is likely because at that size the buffers can not fit into the cache. With smaller buffer sizes both the array and vector version will typically reuse the same memory already present on the stack. I tried printing out the address of the arrays allocated for STL vector and for small arrays we get the same address each time. This means that at each loop iteration we can reuse memory already in cache to write to.
Bottom line is:
At very small buffer sizes (1-5) overhead of function call (virtual) is noticeable.
At small buffers sizes (1 - 1000) we have overhead of memory allocation and deallocation for STL vector.
For large buffers we get overhead of cache misses
KCachegrind is a KDE program that reads profiling files produced by profiling tools. One of the tools it can read the results from is called Valgrind which lets you run your code in a simulator. The benefit of this is that you don’t have to modify your executable to profile it.
To create profiling information for KCachegrind to use we run Valgrind like this:
valgrind --tool=cachegrind ./app-to-profile
That will produce a file with a name like cachegrind.out.9686. In Gnome or KDE clickning these files will usually open them in KCachegrind.

On the left pane under Flat Profile you can see a list over all called methods and how much they cost. Cost is here an abstract term. Cost could be number of CPU cycles, number of instructions fetched, cache misses etc. The first colum has the name Self. This indicates the cost of calling that function minus the cost of the functions it calls. Thus cost is not aggregated. By clicking a function in this list you can watch the source code for that function in the right pane, annotated with the cost of each source code line.

The combo box in the upper right side, lets you chose what kind of cost you want to view in the function list and annoted next to the source code lines.
The right pane has many tabs. The Types tab lets you get an overview of the different event types that exist and their associated cost.

It is the same list of events as you find in the combo box. The event types of interest to us are:
To test the memory layouts affect on performance and cache misses I wrote two small programs, one in a data oriented style and another in a object oriented style. The object oriented version consisted of manipulation of struct defined as:
typedef struct {
double a, b, c, d, e, f, g, h, i, j;
} Attr;
This resembles OOP programming in where we bundle together all attributes related. Then arrays of these structs were created and initialized with random values:
Attr *a, *b;
a = newAttr(count);
b = newAttr(count);
randomizeAttr(a, a + count);
randomizeAttr(b, b + count);
And finally a simple calculation was done on each element.
for (i = 0; i < reps; ++i) {
for (j = 0; j < count; ++j) {
b[j].a = (a[j].a + b[j].a) * a[j].a + 4.0;
}
}
reps is used to repead the same calculation across the array several times over. This is to see what affect the cache might have on the performance.
In the data oriented version we use single structs containing arrays.
typedef struct {
double *a, *b, *c, *d, *e, *f, *g, *h, *i, *j;
} Attr;
Initialization is done in similar fashion
Attr a, b;
a = newAttr(count);
b = newAttr(count);
randomizeAttr(a, count);
randomizeAttr(b, count);
And we perform a similar simple calculation. But as you can see how the data is accessed is different.
for (i = 0; i < reps; ++i) {
for (j = 0; j < count; ++j) {
b.a[j] = (a.a[j] + b.a[j]) * a.a[j] + 4.0;
}
}
In both version the same calculation is performed over the same amount of data, however the performance is radically different.
Each logical unit of data consists of the 10 attributes a, b, c, d, e, f, g, h, i, j which are each double values which takes 8 bytes of memory each. In the OOP version each struct is thus 80 bytes large. If we have count be 200, meaning we have an array of 200 elements, then we have potentially roughly 15 KB that needs to fit into L1 cache. In our second test we set count to 1000 which means roughly 78 KB max needs to fit into cache.
In both tests we try with 1, 10 and 100 repetitions.
In the first test we can see the instruction fetch is 1 400 for our simple calculation with 1 repetition.

When we change repitions to 10 and 100, the instruction fetch changes to 14 000 and 140 000 respectively.

However the cache miss which is at 118 and stays as 118 even if we change number of repetitions to 10 or 100. Data Read access is 400, 4000 and 40 000 respectively. Which increases as expected with repetitions. This suggest that we manage to fit all data into cache so that we don’t get anymore cache misses on more repetitions.
The instruction fetch and data read access is the same for the data oriented version of the same program. Which is to be expected. The L1 data read miss is however substantially different. It is 25 and doesn’t change with repetitons either.
How can the data oriented version get almost 5 times lower cache miss?!
b[j].a = (a[j].a + b[j].a) * a[j].a + 4.0;
As you can see in the code line, calculations are only done on the a attribute. a attributes are continous in memory in the data oriented version. Which means they get pulled in together in the cache when a cache miss occurs. In the object oriented version 9 attributes b, c, d, e, f, g, h, i, j follow in memory. Part of these values are thus also pulled into cache when a cache hit occurs. How much depends on the size of the cache line.
This means that we get more cache misses because most of the extra data pulled in can’t be used in the next memory access. This also shows up in performance. When timing the running of each of these loops, the data oriented version ran about 4 times faster. The main reason for this is of course that memory access is what takes up most time. The calculations are super fast to perform and contributes little to the overal run time.
In the second test we tried with 1000 elements instead of 200. That is 5 times more. The instruction fetch is now 7000 on 1 repetition. Which means we are pulling 5 times as many instructions for the simple calculation as in the first test. That is as expected. For the data oriented version L1 data read miss stays at 255 for 1, 10 and 100 repetitions. This indicates that all the data fits into cache. For the object oriented version results are much worse. We get 2000, 20 000 and 200 000 respectively. This suggests that the cache is trashed. We can’t fit all the data into the cache and so it is flushed on each iteration.
The most likely reason is that the object oriented version requires storing a lot of data which is not used in the calculation. This makes the object oriented version potentially require up to 10 times as much space in cache.
Both of these programs were written in standard C, so you might wonder why I called one object oriented and one data oriented. Obviously there were no classes around. However the object oriented version did simulate how we typically end up laying out memory in an object oriented program. Related attributes are put close in memory. This can be a problem if only a small subset of these attributes are used together in individual calculations. The data oriented design doesn’t necessarily solve all cache problems. As I experimented I noticed that if I made the array large enough performance became equal again on both the data oriented and the object oriented version. This was likely caused by the array getting too large for the data oriented version as well and ending up trashing the cache.