A gentle introduction to SIMD.
2025-09-18
A few weeks ago I was reading up on the one billion row challenge and was wondering what the fastest way
would be to read a file from disk. For comparison I decided it would be
interesting to find out how wc
does it on my Fedora 42 workstation. Turns out
wc
just uses read
but with a buffersize of 256KiB, a value that seems to be optimized for an
average
system.
My attention immediately diverted to the actual implementation that uses AVX2 when available, I've never really taken a closer look at SIMD instructions before and decided it was the right moment as the code didn't look daunting.
1 extern struct wc_lines
2 wc_lines_avx2 (int fd)
3 {
4 intmax_t lines = 0;
5 intmax_t bytes = 0;
6
7 __m256i endlines = _mm256_set1_epi8 ('\n');
8
9 while (true)
10 {
11 __m256i avx_buf[IO_BUFSIZE / sizeof (__m256i)];
12 ssize_t bytes_read = read (fd, avx_buf, sizeof avx_buf);
13 if (bytes_read <= 0)
14 return (struct wc_lines) { bytes_read == 0 ? 0 : errno, lines, bytes };
15
16 bytes += bytes_read;
17 __m256i *datap = avx_buf;
18
19 while (bytes_read >= 32)
20 {
21 __m256i to_match = _mm256_load_si256 (datap);
22 __m256i matches = _mm256_cmpeq_epi8 (to_match, endlines);
23 int mask = _mm256_movemask_epi8 (matches);
24 lines += __builtin_popcount (mask);
25 datap += 1;
26 bytes_read -= 32;
27 }
28
29 /* Finish up any left over bytes */
30 char *end = (char *) datap + bytes_read;
31 for (char *p = (char *) datap; p < end; p++)
32 lines += *p == '\n';
33 }
34 }
The algorithm is pretty simple. It first prepares a 256 bit buffer with 32 \n
values at L7,
then reads a chunk of data, and as long as there is data to read, it loads 256
bits of the data at L21 and compares the data with the buffer of newlines using
_mm256_cmpeq_epi8
. This sets the corresponding bits of the matches buffer to all
1's in case a data byte is a newline or all 0's in case it's not.
To count the number of lines, _mm256_movemask_epi8
creates a mask from the most significant bit of each
byte of the source vector and stores the result in the returned value, in short it transforms 11111111
(a newline match)
into 10000000
, and leaves 00000000
alone. Now what remains is counting the
number of set bits in the mask with __builtin_popcount
.
That's it! The remainder is some accounting when there's leftover data that is too small to fit an AVX2 instruction. Now, that was a gentle introduction, wasn't it?