"What cache?" you may ask. Well most people do not know, but at read time the processor reads a cache line, not only a single value. The length of this cache varies, depending on a few factors (though most of the time it is 32 bytes). However, in the end, it is always a number that is a power of two. Take this into account when you read your arrays.
The same is true when your construct structures. To not force the Central Processing Unit to read a value from the memory twice, construct the structures so they will use an amount of memory that is a power of two. To visualize this problem, I have written a short program that will just present and measure the difference.
#include "stdio.h"
#include "stdlib.h"
// using Windows interface QueryPerformanceCounter( )
// to High Resolution CPU timer
// compute CPU time in microseconds of f():
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
#include <winbase.h>
#define maxN 1000
#define maxM 1000
int t[maxN][maxM];
int f(void)
{
int i,j;
int s = 0;
for(i = 0; i < maxM; ++ i)
for(j = 0; j < maxN; ++ j)
s += t[j][i];
return s;
}
int main()
{
LARGE_INTEGER ticksPerSecond;
LARGE_INTEGER tick; // A point in time
LARGE_INTEGER start_ticks, end_ticks, cputime;
// get the high resolution counter's accuracy
QueryPerformanceFrequency(&ticksPerSecond);
// what time is it?
QueryPerformanceCounter(&tick);
double s = 0;
double temp;
for ( int x = 0; x< 10; ++x)
{
QueryPerformanceCounter(&start_ticks);
/* start f() */
for(int k = 0; k < 100; ++ k)
{
f();
}
/* end f( ) */
QueryPerformanceCounter(&end_ticks);
cputime.QuadPart = end_ticks.QuadPart- start_ticks.QuadPart;
printf ("tElapsed CPU time test: %.9f secn",
temp = ((float)cputime.QuadPart/(float)ticksPerSecond.QuadPart));
s += temp;
}
printf("ntAverage with global variables : %.9fn
Catch maintained", s/10);
return 0;
}
The code of interest is inside the f function. It is about the line: "s += t[j][i];". Here you can observe that we calculate the sum of the matrix by breaking the cache line after every read. When the CPU reads t[i][j] it will also read the t[i][j+1], t[i][j+2] and t[i][j+3]. However, if we calculate the sum this way at every read, it will have to reread the values, as the ones inside the catch are no longer usable. The time required to complete the process on my machine, with an E4300 Intel Dual Core at 1.8 GHz, is:
Now if we just switch up the two variables and write:
s += t[i][j];
The speed of the program now is:
That is double. Talk about a huge difference! In the future, remember that the CPU is still much faster than the memory, so avoid breaking the cache. Also, avoid using the global variables for local purpose; first, because this can lead to strange problems, and second, because this can slow down the application. Now this will have very little influence, but the influence is still there.
The local memory pool (stack) is much faster than the global one. This is limited to 4MB, though, so avoid exceeding this amount. In contests the limit is set at 1MB, so really, do not overuse it.
By now, you've probably heard all sorts of bad things about the loop encapsulated in C under the name of "for." Most teachers recommend that you avoid using for loops too much, and write as few of them as you can.
Now I am here to encourage you to take full advantage of its syntax to produce code that is at the same time readable and efficient. For this, let us repeat the syntax of the "for" loop.
for( commands that will be executed before entering the loop;
the loop will run until this statement is false;
commands that will be executed at the end of each loop);
{
commands that will be executed at each loop;
}
Now then, let us see two stylish usages of this loop. The first one is implementing the merge sort (has an average and worst-case performance of O(n log n) ) based on the idea of Mircea Pasoi.
int N, A[N], B[N];
void merge_sort(int l, int r)
{
int m = (l + r) >> 1; // divide by 2 via bitwise operator
int i, j, k;
if (l == r) return; // a single item
merge_sort(l, m); // sort on the left branch
merge_sort(m + 1, r);// sort on the right branch
for (i=l, j=m+1, k=l; i<=m || j<=r; )
if (j > r || (i <= m && A[i] < A[j]))
B[k++] = A[i++];
else
B[k++] = A[j++];
for (k = l; k <= r; k++) A[k] = B[k];
}
These solutions will not necessarily speed up your code; however, they are easy to implement and easy to follow two attributes that are just as important as being efficient at a challenge. Of course, at a contest you can forget the comments. I wrote them down now just for didactic purpose.
Here the A is the array that you want to sort and for this, we will use the B array. In the end, both of them will contain the sorted list of numbers in ascending order. On the following, I will present how to work with huge numbers without complicating things too much.
These methods are originally from Radu Berinde. The adequate way of dealing with these operations is to store the numbers inside an array with a reverse order of the reading (from the file) and the 0-th value of the array will point out how many digits compose the number. The subsequent methods are easy to implement, easy to understand, and efficient. What more could you desire?
Adding up two large numbers:
void add(int A[], int B[])
{
int i, t = 0;
for (i=1; i<=A[0] || i<=B[0] || t; i++, t/=10)
A[i] = (t += A[i] + B[i]) % 10;
A[0] = i - 1;
}
Multiplying one large number with a small one:
void mul(int A[], int B)
{
int i, t = 0;
for (i = 1; i <= A[0] || t; i++, t /= 10)
A[i] = (t += A[i] * B) % 10;
A[0] = i - 1;
}
Multiplying two large numbers:
void mul(int A[], int B[])
{
int i, j, t, C[NUMBER_OF_DIGITS];
memset(C, 0, sizeof(C));
for (i = 1; i <= A[0]; i++)
{
for (t=0, j=1; j <= B[0] || t; j++, t/=10)
C[i+j-1]=(t+=C[i+j-1]+A[i]*B[j])%10;
if (i + j - 2 > C[0]) C[0] = i + j - 2;
}
memcpy(A, C, sizeof(C));
}
The minus operator between two large numbers:
void sub(int A[], int B[])
{
int i, t = 0;
for (i = 1; i <= A[0]; i++)
A[i] += (t = (A[i] -= B[i] + t) < 0) * 10;
for (; A[0] > 1 && !A[A[0]]; A[0]--);
}
Dividing a large number with a small one:
void div(int A[], int B)
{
int i, t = 0;
for (i = A[0]; i > 0; i--, t %= B)
A[i] = (t = t * 10 + A[i]) / B;
for (; A[0] > 1 && !A[A[0]]; A[0]--);
}
The rest of a large number divided with a small one:
int mod(int A[], int B)
{
int i, t = 0;
for (i = A[0]; i > 0; i--)
t = (t * 10 + A[i]) % B;
return t;
}
Now we've already learned a few interesting methods, so to conclude this article (and this series) I will present a solution that uses the possibilities offered by the C+ world to throw calculating the factorial from the application to the compilation time. This is the original idea of Alexandru Mosoi. All we need is the constant for which to calculate the factorial. All of this will be done during compile time, and the compiler will insert that value for the version that will run on your system
#include <stdio.h>// for printing on the screen
template<int N> // we write a recursive template declaration
struct Factorial {
enum {
value = Factorial<N-1>::value * N
};
};
template<> // and write the specialization for the zero value
struct Factorial<0> {
enum { value = 1 };
};
int main(void)
{
int i = Factorial<4>::value;
char c[Factorial<5>::value];
printf("%d ",i);
printf("%d ",sizeof(c));
}
With the result of:
Remember that templates are a new addition to C++ with which you can create general classes and functions. They are generated only at compile time for the required value/type. Now that is what I call a smart solution. This will throw all the work to the compiler and maximize the readability of your code.
Continue your efforts to learn new smart solutions and you are on the right path to eventually succeed in programming competitions -- and moreover, be envied by the others. With this, I am closing this article series, therefore I would like to thank you for reading all of them and invite you to write down your judgment of them here on the blog, or join the friendly forums at DevHardware or DevArticles and leave me a note there. Live with Passion!