Performance Issue of Memory Alignment

Last updated on July 26, 2023 pm

Performance Issue of Memory Alignment

We have already discussed some tricks for aligning numbers. In this post, we will explore the difference in code performance between aligned and unaligned memory access, especially when accessing and assigning to the members of struct (and possibly class). A piece of ready-to-go C++ code is provided in the appendix.

Design

We plan to use struct to demonstrate the performance difference. Basically, it would contain three members: char, int and double. We will try to align them in different ways and see the performance difference.

1
2
3
4
5
6
struct Struct1
{
volatile char my_char = '\0';
volatile int my_int = 0;
volatile double my_double = 0.0;
}

The size of each type is as follows, but the size of Struct1, which is presumably the sum of the size of its members, is not guaranteed.

Data Type Size in Bytes
char 1
int 4
double 8
Struct1 16

If we run with the code sizeof(Struct1), the result might actually be 16, because the compiler does some padding to make the size of Struct1 a multiple of 4, trying to make it not too inefficient. And we won’t touch this part for simplicity’s sake.

By permutation without repetition, we have 6 ways to align the members of our struct, namely Struct1 to Struct6 (see the appendix test.hh).

Implementation

The idea is simple:

  1. Initialize an object;
  2. Assign values to its members;
  3. Loop for many many times;
  4. Record the total time consumed of one type.
  5. Compare the time across different types.

So the basic code looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#define ITER 10000000
int main() {
// get start time
auto start = std::chrono::high_resolution_clock::now();

// start looping
for (int i = 0; i < ITER; ++i) {
Struct1 data;
data.c = 'A';
data.i = 42;
data.d = 3.14;
}

// get finish time
auto end = std::chrono::high_resolution_clock::now();
// calculate time duration in microseconds
auto duration =
std::chrono::duration_cast<std::chrono::microseconds>(end - start)
.count();

std::cout
<< "Access time of Struct1 with proper alignment: "
<< duration << " microseconds" << std::endl;
}

Just replace Struct1 with Struct2Struct6 and we can get the time consumed by each type.

But some changes are needed to make the code more readable and convenient to collect data.

  1. Implemented a template with generic type to avoid code duplication.
  2. Used a vector time to store the time consumed by each type.
  3. For each value of iter, we run the test for 5 times, or, 5 epochs.
  4. Used two arrays min_arr and max_arr to keep track of which type contributes to the max and min consumed time.
  5. Save the result as a csv file for further analysis.
  6. Prompt out current iteration and epoch so that we can know the progress.

For example, say after one epoch, when we get the time vector with values 294146, 276858, 265321, 267835, 278274, 282499, we can know that Struct3 is the fastest and Struct1 is the slowest. Then we can increment min_arr[2] and max_arr[0] (indies start at 0) by 1, this is done by the count() function.

Note also that the order in which the members are accessed counts. For example, if we access char first, then int, then double, this is different from accessing double first, then int, then char. This will also affect the performance. Here, we simply access them in the reverse order of their declaration.

Implementation details can be found in the appendix test.cpp.

Result

We calculated iter from 100000000 to 100001000 with step 10, and epoch from 1 to 5. This means that we have a total of 500 rows of data. So the result is quite accurate and convincing from statistical perspective.

The final min_arr and max_arr are as follows:

Struct1 Struct2 Struct3 Struct4 Struct5 Struct6
MAX Count 288 209 2 4 0 2
MIN Count 0 0 50 67 331 57

We can see that in general Struct5 is the fastest and Struct1 is the slowest.

Visualization

And they are quite dominant over other types.

The full results can be found here.

Analysis

This result is quite self-explanatory. If we check the size of these structs, they are not all 16 bytes.

Data Type Size in Bytes
Struct1 16
Struct2 24
Struct3 16
Struct4 24
Struct5 16
Struct6 16

We can see that Struct2 and Struct4 are 24 bytes, which is not a multiple of 8. The compilers do padding and alignment based on 8 bytes for each structure member. GCC/g++ also has an alignment attribute that can be used to control the alignment of structure members. For example, we can use __attribute__((aligned(8))) to ensure that the structure is aligned to 8 bytes when using GCC/g++.

References

Appendix

Environment and Compilation

  • Compiler: MinGW32 GCC 9.2.0
  • OS: Windows 10
  • Arch: X86_64

Build with command g++ -O0 test.cpp -o test.

test.hh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
#ifndef __TEST_HH__
#define __TEST_HH__

#define ITER_MAX 100001000
#define ITER_MIN 100000000

struct Struct1
{
volatile char my_char = '\0';
volatile int my_int = 0;
volatile double my_double = 0.0;
public:
void assign()
{
my_char = 'a';
my_int = 42;
my_double = 3.14;
}
};

struct Struct2
{
volatile char my_char = '\0';
volatile double my_double = 0.0;
volatile int my_int = 0;
public:
void assign()
{
my_char = 'a';
my_double = 3.14;
my_int = 42;
}
};

struct Struct3
{
volatile int my_int = 0;
volatile char my_char = '\0';
volatile double my_double = 0.0;
public:
void assign()
{
my_int = 42;
my_char = 'a';
my_double = 3.14;
}
};

struct Struct4
{
volatile int my_int = 0;
volatile double my_double = 0.0;
volatile char my_char = '\0';
public:
void assign()
{
my_int = 42;
my_double = 3.14;
my_char = 'a';
}
};

struct Struct5
{
volatile double my_double = 0.0;
volatile char my_char = '\0';
volatile int my_int = 0;
public:
void assign()
{
my_double = 3.14;
my_char = 'a';
my_int = 42;
}
};

struct Struct6
{
volatile double my_double = 0.0;
volatile int my_int = 0;
volatile char my_char = '\0';
public:
void assign()
{
my_double = 3.14;
my_int = 42;
my_char = 'a';
}
};

#endif // __TEST_HH__

test.cpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
#include <iostream>
#include <chrono>
#include <vector>
#include <algorithm>
#include <fstream>

#include "test.hh"

int min_arr[6];
int max_arr[6];

template <typename StructType>
long long runTest(long long iter)
{
auto start = std::chrono::high_resolution_clock::now();

for (long long i = 0; i < iter; i++)
{
// init object and do value assignment `iter` times
StructType data;
data.assign();
}

auto finish = std::chrono::high_resolution_clock::now();
auto duration =
std::chrono::duration_cast<std::chrono::microseconds>(finish - start)
.count();

return duration;
}

void print_min_max(){

std::cout << "MAX [";
for (int i = 0; i < 6; i++)
std::cout << max_arr[i] << ", ";
std::cout << "]" << std::endl;

std::cout << "MIN [";
for (int i = 0; i < 6; i++)
std::cout << min_arr[i] << ", ";
std::cout << "]" << std::endl;
}

void count(std::vector<long long> &time_vec)
{
// keep track of the max and min
auto max = std::max_element(time_vec.begin(), time_vec.end()) - time_vec.begin();
max_arr[max] += 1;
auto min = std::min_element(time_vec.begin(), time_vec.end()) - time_vec.begin();
min_arr[min] += 1;
return;
}

int main()
{
std::vector<long long> time;

// write timing result to csv file
std::ofstream output_file;
output_file.open("output.csv");
output_file // print csv headers
<< "iteration,epoch," << "struct1," << "struct2,"
<< "struct3," << "struct4," << "struct5,"
<< "struct6," << std::endl;

for (long long iter = ITER_MIN; iter <= ITER_MAX; iter += 10)
{
std::cout << "Iteration time: " << iter << std::endl;
for (int epoch = 1; epoch < 6; epoch++)
{
std::cout << "Epoch: " << epoch << std::endl;
time.push_back(runTest<Struct1>(iter));
time.push_back(runTest<Struct2>(iter));
time.push_back(runTest<Struct3>(iter));
time.push_back(runTest<Struct4>(iter));
time.push_back(runTest<Struct5>(iter));
time.push_back(runTest<Struct6>(iter));
output_file << iter << "," << epoch << ",";
for (int index = 0; index < time.size(); index++)
output_file << time[index] << ",";
output_file << std::endl;
count(time);
time.clear();
}
print_min_max();
}

print_min_max();
output_file.close();

return 0;
}

plot.py

Also include a Python script to plot the result, as a backup.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5, 6]
y_max = [-288, -209, -2, -4, 0, -2] # slowest
y_min = [0, 0, 50, 67, 331, 57] # fastest

fig, ax = plt.subplots(facecolor="#eaeaf2",)

fig.set_figheight(6)
fig.set_figwidth(6)

ax.set_xlabel("Structs")
ax.set_ylabel("Times of Max/Min")
ax.set_title("Max/Min Counts over Experiment", fontsize=18, pad=15)

bar_max = ax.bar(x, y_max, 0.35, label="slowest", color="#97d6fc") # green
bar_min = ax.bar(x, y_min, 0.35, label="fastest", color="#ff8282") # red
ax.axhline(0, color="black", linewidth=0.5)
ax.legend()

# add value labels above each bar
for bar in bar_min:
height = bar.get_height()
ax.text(
bar.get_x() + bar.get_width() / 2.0,
height,
f"{int(height)}",
ha="center",
va="bottom",
fontsize=9,
)

# add value labels below each bar
for bar in bar_max:
height = bar.get_height()
ax.text(
bar.get_x() + bar.get_width() / 2.0,
height-25,
f"{-int(height)}",
ha="center",
va="bottom",
fontsize=9,
)

plt.xticks(rotation=45)
plt.tight_layout()
plt.subplots_adjust(wspace=0, top=0.85, bottom=0.12, left=0.18, right=0.95)
# plt.show()
plt.savefig("align-test.png", facecolor="#eee", dpi=500)

Performance Issue of Memory Alignment
https://lingkang.dev/2023/07/27/align-test/
Author
Lingkang
Posted on
July 26, 2023
Licensed under