Dot product on misaligned data

One of my most popular blog posts of all times is  Data alignment for speed: myth or reality? According to my dashboard, hundreds of people a week still load the old blog post. A few times a year, I get an email from someone who disagrees.

The blog post makes a simple point. Programmers are often told to worry about ‘unaligned loads’ for performance. The point of my blog post is that you should generally no worry about alignment when optimizing your code.

An unaligned load occurs when a processor attempts to read data from memory at an address that is not properly aligned. Most computer architectures require data to be accessed at addresses that are multiples of the data’s size (e.g., 4-byte data should be accessed at addresses divisible by 4). For example, a 4-byte integer of float should be loaded from an address like 0x1000 or 0x1004 (aligned), but if the load is attempted from 0x1001 (not divisible by 4), it is unaligned. In some conditions, an unaligned load can crash your system and it general leads to ‘undefined behaviors’ in C++ or C.

Related to this alignment issue is that data is typically organized in cache lines (64 bytes or 128 bytes or most systems) that are loaded together. If you load data randomly from memory, you might touch two cache lines which could cause an additional cache miss. If you need to load data spanning two cache lines, there might be a penalty (say one cycle) as the processor needs to access the two cache lines and reassemble the data. Further, there is also the concept of a page of memory (4 kB or more).  Accessing an additional page could be costly, and you typically want to avoid accessing more pages than you need to. However, you have to be somewhat unlucky to frequently cross two pages with one load operation.

How can you end up with unaligned loads? It often happens when you access low-level data structures, assigning data to some bytes. For example, you might load a binary file from disk, and it might say that all the bytes after the first one are 32-bit integers. Without copying the data, it could be difficult to align the data. You might also be packing data: imagine that you have a pair of values, one that fits in byte and the other that requires 4 bytes. You could pack these values using 5 bytes, instead of 8 bytes.

There are cases were you should worry about alignment. If you are crafting your own memory copy function, you want to be standard compliant (in C/C++) or you need atomic operations (for multithreaded operations). You might also encounter 4K Aliasing, an issue Intel describes where arrays stored in memory at locations that are nearly a multiple of 4KB can mislead the processor into thinking data is being written and then immediately read.

However, my general point is that it is unlikely to be a performance concern.

I decided to run a new test given that I haven’t revisited this problem since 2012. Back then I used a hash function. I use SIMD-based dot products with either ARM NEON intrinsics or AVX2 intrinsics. I build two large arrays of 32-bit floats and I compute the scalar product. That is, I multiply the elements and sum the products. The arrays fit in a megabyte so that we are not RAM limited.

I run benchmarks on an Apple M4 processor as well as on an Intel Ice Lake processor.

On the Apple M4… we can barely see the alignment (10% effect).

Byte Offset ns/float ins/float instruction/cycle
0 0.059 0.89 2.93
1 0.064 0.89 2.75
2 0.062 0.89 2.82
3 0.064 0.89 2.69
4 0.062 0.89 2.84
5 0.064 0.89 2.75
6 0.062 0.89 2.69
7 0.064 0.89 2.75

And we cannot see much of an effect on the Intel Ice Lake processor.

Byte Offset ns/float ins/float instruction/cycle
0 0.086 0.38 1.36
1 0.087 0.38 1.36
2 0.087 0.38 1.36
3 0.087 0.38 1.36
4 0.086 0.38 1.36
5 0.087 0.38 1.36
6 0.086 0.38 1.36
7 0.086 0.38 1.36

Using the 512-bit registers from AVX-512 does not change the conclusion.

My point is not that you cannot somehow detect the performance difference due to alignment in some tests. My point is that it is simply not something that you should generally worry about as far as performance goes.

My source code is available.

Daniel Lemire, "Dot product on misaligned data," in Daniel Lemire's blog, July 14, 2025, https://lemire.me/blog/2025/07/14/dot-product-on-misaligned-data/.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

4 thoughts on “Dot product on misaligned data”

  1. Unaligned reads are generally fast on modern processors, even if they cross cache lines.

    I did notice some issues with your benchmark:

    * your multiply-accumulate loop only uses a single accumulator. Optimal code would use multiple accumulators
    * a test size of 100k 32-bit floats would consume ~800KB of cache, which definitely exceeds L1 capacity and may not fit in L2
    * my knowledge of floating point isn’t extensive enough, but I wonder if generating random numbers between 0 and 1, then accessing them at byte offsets causes some to be interpreted as subnormals/NaNs and the like. Given that your results don’t indicate any difference for offsets not a multiple of 4, I guess not

    Fixing the above may not change the conclusion that aligned reads aren’t important, but I thought the issues should be pointed out regardless.

Leave a Reply

You may subscribe to this blog by email.