2024.09 Vol.1

// Making use of SIMD #Computer-Science

More Power

The adventure continues with the ChaCha20Poly1305 AEAD implementation. It is time to make it more performant.

The goal in any ecosystem leveraging encryption is to get as many of its users as possible using the encryption. If for any reason it is super expensive to use encryption, than a user may decide to just not use it. This chips away at the pool of users who are actually using encryption. And if that pool (the anonymity set) gets small enough, the encryption becomes less powerful. The metadata of who is using the encryption becomes valuable. Why are Bob and Alice the only two people using encryption? What is so valuable that they are paying the high cost to hide it?

Ideally the use of encryption is free and on by default for all users. Now, it is impossible for it to by 100% free. At some point the work to cryptographically scramble the bits needs to happen. But if it is efficient enough that a user can’t even tell that it is happening, than that is usually as good as free. This will keep the anonymity set large and the encryption valuable for the ecosystem.

In our implementation of the ChaCha20 stream cipher, the bit scrambling (heavy lifting) happens in the chacha_block function. So we need to make sure the chacha_block function is as efficient as possible.

One performance tool is an architecture’s Single Instruction Multiple Data (SIMD) assembly instructions. We will see below how the requirements of SIMD instruction usage lines up well with cryptographic computations. Ideally the Rust compiler would automatically make use of the SIMD assembly instructions for us. It would just magically “auto-vector-ize” the code without having to rely on explicit architecture dependent “intrinsic” hints from the developer. But as of today…it actually kinda of does the magic! However, the code usually needs to be massaged a bit in order for the compiler to automatically recognize a vectoriz-able function. While this adds complexity, it is way more maintainable than per-architecture settings.

Naive Implementation with Rust v1.80.1

So we start by testing the initial naive ChaCha20 implementation which makes no effort to help the compiler recognize vectoriazable functions. It just follows the RFC spec, and while subjective, I think it is pretty easy to read.

Tests are performed on the current stable version of rustc which is 1.80.1. The more modern tooling is generally better to diagnose asm (assembly) usage than the conservative MSRV 1.63.0 of the rust-bitcoin project. And hopefully any findings can be leveraged by older Rust compilers without too much trouble.

The tests involve two compile time flags which enable the compiler to squeeze out as much SIMD instructions it can for the code.

  • opt-level=3 – Turn the optimization level to the maximum level of 3, a.k.a. cargo’s --release profile. Requiring this flag doesn’t feel like a huge ask for library users since it is the cargo default.
  • target-cpu=native – A general “tune the shit out of this code for this specific architecture”. I imagine this is at the cost of the resulting executable being less portable, but I haven’t dug into the weight of the trade-offs here. But it is likely a larger ask for library users than the optimization level flag.

The naive implementation of chacha_block is at least very well contained in the State struct which represents the state of the ChaCha20 cipher. This makes it easy to analyze with tools like asm to see exactly what the functions are boiling down to in the assembly and if the compiler is taking advantage of SIMD instructions.

/// The 512-bit cipher state is chunk'd up into 16 32-bit words.
///
/// The 16 words can be visualized as a 4x4 matrix:
///
///   0   1   2   3
///   4   5   6   7
///   8   9  10  11
///   12  13  14  15
#[derive(Clone, Copy, Debug)]
struct State([u32; 16]);

impl State {
    /// New prepared state.
    const fn new(key: SessionKey, nonce: Nonce, count: u32) -> Self {
        let mut state: [u32; 16] =
            [WORD_1, WORD_2, WORD_3, WORD_4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];

        // The next two rows (8 words) are based on the session key.
        // Using a while loop here to keep the function constant.
        let mut i = 0;
        while i < 8 {
            state[i + 4] = u32::from_le_bytes([
                key.0[i * 4],
                key.0[i * 4 + 1],
                key.0[i * 4 + 2],
                key.0[i * 4 + 3],
            ]);
            i += 1;
        }

        // The final row is the count and the nonce.
        state[12] = count;
        state[13] = u32::from_le_bytes([nonce.0[0], nonce.0[1], nonce.0[2], nonce.0[3]]);
        state[14] = u32::from_le_bytes([nonce.0[4], nonce.0[5], nonce.0[6], nonce.0[7]]);
        state[15] = u32::from_le_bytes([nonce.0[8], nonce.0[9], nonce.0[10], nonce.0[11]]);

        State(state)
    }

    /// Each "quarter" round of ChaCha scrambles 4 of the 16 words (where the quarter comes from)
    /// that make up the state using some Addition (mod 2^32), Rotation, and XOR (a.k.a. ARX).
    fn quarter_round(&mut self, a: usize, b: usize, c: usize, d: usize) {
        self.0[a] = self.0[a].wrapping_add(self.0[b]);
        self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16);
        self.0[c] = self.0[c].wrapping_add(self.0[d]);
        self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(12);
        self.0[a] = self.0[a].wrapping_add(self.0[b]);
        self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(8);
        self.0[c] = self.0[c].wrapping_add(self.0[d]);
        self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(7);
    }

    /// Transform the state by performing the ChaCha block function.
    fn chacha_block(&mut self) {
        let initial_state = self.0;
        for _ in 0..10 {
            for (a, b, c, d) in CHACHA_ROUND_INDICIES {
                self.quarter_round(a, b, c, d);
            }
        }
        for (modified, initial) in self.0.iter_mut().zip(initial_state.iter()) {
            *modified = modified.wrapping_add(*initial)
        }
    }

    /// Expose the 512-bit state as a byte stream.
    fn keystream(&self) -> [u8; 64] {
        let mut keystream: [u8; 64] = [0; 64];
        for (k, s) in keystream.chunks_exact_mut(4).zip(self.0) {
            k.copy_from_slice(&s.to_le_bytes());
        }

        keystream
    }
}

The initial naive implementation of the ChaCha20 cipher.

The following long output is the asm of the function generated with the cargo-show-asm tool: cargo asm --package chacha20-poly1305 chacha20_poly1305::chacha20::State::chacha_block. Using the --rust flag with cargo-show-asm makes it output little hints as to where you are in the Rust source which is super helpful.

And right away we have some findings!

  • The quarter_round function is always being inlined (no new function on the call-stack) by the compiler no matter the optimization level, so in order to analyze its machine code we need to add a #[inline(never)] just for testing purposes.
  • The chacha_block function has some pretty easy to find for loops, the first nested pair map to .LBB0_1 (outer) and .LBB0_2 (inner).
  • .LBB0_2 clearly contains all the quarter_round craziness, the function has definitely been inlined by the compiler.
  • There are some SIMD instructions being used, like movdqu and paddd, but they are only used in the third (last) for loop of the chacha_block function. The inner quarter round operations are performed using scalar instructions (add, xor, rol).
    • movups, movaps, movdqu, movdqa – All are ways to efficiently move around 128 bits in the special SIMD XMM* registers.
    • paddd – Parallel addition of four 32-bit integers.
    • These are all SSE (Streaming SIMD Extensions) instructions which are a subset of all possible SIMD instructions. More modern AVX ones are also possible. AVX operate on larger number of bits (256 and 512) so can be even more efficient.
  • Even with both the opt-level=3 and target-cpu=native flags enabled, the quarter round function does not use SIMD instructions. Sad!
.section .text.chacha20_poly1305::chacha20::State::chacha_block,"ax",@progbits
	.p2align	4, 0x90
	.type	chacha20_poly1305::chacha20::State::chacha_block,@function
chacha20_poly1305::chacha20::State::chacha_block:
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 103
		fn chacha_block(&mut self) {
	.cfi_startproc
	push rbp
	.cfi_def_cfa_offset 16
	push r15
	.cfi_def_cfa_offset 24
	push r14
	.cfi_def_cfa_offset 32
	push r12
	.cfi_def_cfa_offset 40
	push rbx
	.cfi_def_cfa_offset 48
	sub rsp, 336
	.cfi_def_cfa_offset 384
	.cfi_offset rbx, -48
	.cfi_offset r12, -40
	.cfi_offset r14, -32
	.cfi_offset r15, -24
	.cfi_offset rbp, -16
	mov rbx, rdi
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 104
		let initial_state = self.0;
	movups xmm0, xmmword ptr [rdi]
	movaps xmmword ptr [rsp + 48], xmm0
	movups xmm0, xmmword ptr [rdi + 16]
	movaps xmmword ptr [rsp + 32], xmm0
	movups xmm0, xmmword ptr [rdi + 32]
	movaps xmmword ptr [rsp + 16], xmm0
	movups xmm0, xmmword ptr [rdi + 48]
	movaps xmmword ptr [rsp], xmm0
	xor ebp, ebp
	lea r14, [rip + .L__unnamed_1]
	lea r15, [rsp + 64]
	mov r12, qword ptr [rip + memcpy@GOTPCREL]
	.p2align	4, 0x90
.LBB0_1:
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/int_macros.rs : 2222
		let (a, b) = intrinsics::add_with_overflow(self as $ActualT, rhs as $ActualT);
	inc ebp
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 106
		for (a, b, c, d) in CHACHA_ROUND_INDICIES {
	mov edx, 256
	mov rdi, r15
	mov rsi, r14
	call r12
	mov qword ptr [rsp + 320], 0
	mov qword ptr [rsp + 328], 8
	mov edx, 24
	.p2align	4, 0x90
.LBB0_2:
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ptr/mod.rs : 1325
		crate::intrinsics::read_via_copy(src)
	mov rax, qword ptr [rsp + rdx + 40]
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 92
		self.0[a] = self.0[a].wrapping_add(self.0[b]);
	cmp rax, 15
	ja .LBB0_9
	mov rdi, qword ptr [rsp + rdx + 48]
	cmp rdi, 16
	jae .LBB0_10
	mov rcx, qword ptr [rsp + rdx + 56]
	mov r8, qword ptr [rsp + rdx + 64]
	mov esi, dword ptr [rbx + 4*rdi]
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	add esi, dword ptr [rbx + 4*rax]
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 92
		self.0[a] = self.0[a].wrapping_add(self.0[b]);
	mov dword ptr [rbx + 4*rax], esi
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 93
		self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16);
	cmp r8, 16
	jae .LBB0_11
	xor esi, dword ptr [rbx + 4*r8]
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
		return intrinsics::rotate_left(self, n);
	rol esi, 16
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 93
		self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16);
	mov dword ptr [rbx + 4*r8], esi
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 94
		self.0[c] = self.0[c].wrapping_add(self.0[d]);
	cmp rcx, 16
	jae .LBB0_12
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	add esi, dword ptr [rbx + 4*rcx]
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 94
		self.0[c] = self.0[c].wrapping_add(self.0[d]);
	mov dword ptr [rbx + 4*rcx], esi
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 95
		self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(12);
	xor esi, dword ptr [rbx + 4*rdi]
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
		return intrinsics::rotate_left(self, n);
	rol esi, 12
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 95
		self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(12);
	mov dword ptr [rbx + 4*rdi], esi
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	add esi, dword ptr [rbx + 4*rax]
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 96
		self.0[a] = self.0[a].wrapping_add(self.0[b]);
	mov dword ptr [rbx + 4*rax], esi
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 97
		self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(8);
	xor esi, dword ptr [rbx + 4*r8]
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
		return intrinsics::rotate_left(self, n);
	rol esi, 8
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 97
		self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(8);
	mov dword ptr [rbx + 4*r8], esi
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	add esi, dword ptr [rbx + 4*rcx]
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 98
		self.0[c] = self.0[c].wrapping_add(self.0[d]);
	mov dword ptr [rbx + 4*rcx], esi
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 99
		self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(7);
	xor esi, dword ptr [rbx + 4*rdi]
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
		return intrinsics::rotate_left(self, n);
	rol esi, 7
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 99
		self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(7);
	mov dword ptr [rbx + 4*rdi], esi
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/index_range.rs : 119
		if self.len() > 0 {
	add rdx, 32
	cmp rdx, 280
	jne .LBB0_2
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/cmp.rs : 1565
		fn lt(&self, other: &$t) -> bool { (*self) < (*other) }
	cmp ebp, 10
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/iter/range.rs : 753
		if self.start < self.end {
	jne .LBB0_1
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
		*modified = modified.wrapping_add(*initial)
	movdqu xmm0, xmmword ptr [rbx]
	movdqa xmm1, xmmword ptr [rsp + 48]
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	paddd xmm1, xmm0
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
		*modified = modified.wrapping_add(*initial)
	movdqu xmm0, xmmword ptr [rbx + 16]
	movdqa xmm2, xmmword ptr [rsp + 32]
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	paddd xmm2, xmm0
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
		*modified = modified.wrapping_add(*initial)
	movdqu xmm0, xmmword ptr [rbx + 32]
	movdqa xmm3, xmmword ptr [rsp + 16]
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	paddd xmm3, xmm0
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
		*modified = modified.wrapping_add(*initial)
	movdqu xmm0, xmmword ptr [rbx + 48]
	movdqa xmm4, xmmword ptr [rsp]
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	paddd xmm4, xmm0
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
		*modified = modified.wrapping_add(*initial)
	movdqu xmmword ptr [rbx], xmm1
	movdqu xmmword ptr [rbx + 16], xmm2
	movdqu xmmword ptr [rbx + 32], xmm3
	movdqu xmmword ptr [rbx + 48], xmm4
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 113
		}
	add rsp, 336
	.cfi_def_cfa_offset 48
	pop rbx
	.cfi_def_cfa_offset 40
	pop r12
	.cfi_def_cfa_offset 32
	pop r14
	.cfi_def_cfa_offset 24
	pop r15
	.cfi_def_cfa_offset 16
	pop rbp
	.cfi_def_cfa_offset 8
	ret
.LBB0_9:
	.cfi_def_cfa_offset 384
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 92
		self.0[a] = self.0[a].wrapping_add(self.0[b]);
	lea rdx, [rip + .L__unnamed_2]
	mov esi, 16
	mov rdi, rax
	call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
.LBB0_10:
	lea rdx, [rip + .L__unnamed_3]
	mov esi, 16
	call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
.LBB0_11:
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 93
		self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16);
	lea rdx, [rip + .L__unnamed_4]
	mov esi, 16
	mov rdi, r8
	call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
.LBB0_12:
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 94
		self.0[c] = self.0[c].wrapping_add(self.0[d]);
	lea rdx, [rip + .L__unnamed_5]
	mov esi, 16
	mov rdi, rcx
	call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]

cargo asm –lib –rust –package chacha20-poly1305 chacha20_poly1305::chacha20::State::chacha_block

Adding the target-cpu=native compile flag enables a bit more SIMD usage on my laptop. It uses the 256-bit AVX2 instructions instead of the 128-bit SSE2. But it still does not optimize the quarter round function, so not a game changer.

Type Modifications

So something needs to change with the quarter_round function so the compiler knows it can use SIMD instructions in it. The bulk of work is happening in the quarter_round, so for it to not be optimized is a big performance blow. But we don’t want to depend on intrinsic hints (e.g. #[target_feature(enable = "avx2")]) since they are architecture specific and would be a pain to maintain.

One option which allows for more optimizations is to generate multiple blocks in parallel. The only difference in the inputs would be a different count for the block counter. This is a form of Horizontal Vectorization. Each state representation, 512-bits, can theoretically be operated on in parallel. And it nicely fits into some of the most modern SIMD AVX-512 instructions, which I doubt is a coincidence.

[a1, a2, a3, a4] -> a1 + a2 + a3 + a4
[b1, b2, b3, b4] -> b1 + b2 + b3 + b4

Horizontal Vectorization, each vector is operated on in parallel.

This does introduce quite a bit more bookkeeping complexity for the cipher implementation to keep track of the count and making sure generated blocks are used in order. Any missteps and the train goes off the rails between the parties communicating. On the plus side, it is simply a “wrapper” around the current State implementation, but I am not quite sure what other “tips” I would give the compiler in order to leverage this. So going to focus on lower hanging fruit for the time being.

And that fruit is Verticle Vectorization.

[a1, a2, a3, a4] + [b1, b2, b3, b4] = [a1+b1, a2+b2, a3+b3, a4+b4]

Verticle Vectorization, operates on date in multiple vectors in parallel.

When looking at ChaCha’s quarter round function, the simple scalar approach operates on 4 32-bit words at a time, which is quarter of the total state. A quick recap here, the cipher’s state is broken up into 16 words. When the state is updated (a.k.a. a block created), the quarter round function is first run on the words with indexes [0,4,8,12]. Looking at the ascii art below, you can visualize this as a “column” of the state.

 0   1   2   3
 4   5   6   7
 8   9  10  11
12  13  14  15

4x4 matrix indexes.

Next, a quarter round is run on [1,5,9,13], then [2,6,10,14], and finally [3,7,11,15]. All the rows! It then runs some quarter rounds across some “diagonals”, but before going there, notice that the row quarter rounds each act on an independent chunk of data. So, while the scalar quarter round implementation acts on a “column”, the vectorized approach will take all the “rows” and apply the quarter round to each “lane” (SIMD word for column).

[  0  | 1 |  2 |  3 ] // U32x4
[  4  | 5 |  6 |  7 ] // U32x4
[  8  | 9 | 10 | 11 ] // U32x4
[ 12  |13 | 14 | 15 ] // U32x4

Rows and Lanes.

Each row is a vector of 4 32-bit words. Not by chance, this vector size fits perfectly into a lot of SIMD instructions. The quarter round operations are still applied on the columns (e.g. [0,4,8,12]), but this can be done in parallel with the other columns by leveraging the SIMD lanes. We just have to organize our data in a way for the compiler to pick up on the fact that it can do this. And as we saw above in the test on the naive code, tossing a bunch of tuples at it ain’t good enough.

The Rust std library actually has some examples of some helpful SIMD types in the appropriately named std::simd crate: u32x4 and u32x16. These are also available in core which is good news for no-std support. However, the interface is still experimental and feature gated on 86656, and the portable_simd feature is not available in Rust 1.63.0 (rust-bitcoin’s MSRV). Bummer! But we can experiment with types similar to the std library (which by the way is what LDK did in their ChaCha implementation).

And boom, here is the new fancy U32x4 type which matches the std library’s simd interface for easy migration in the future. This one though implements the bare-minimum for ChaCha.

#[repr(C, align(16))]
#[derive(Clone, Copy, Debug, PartialEq)]
struct U32x4([u32; 4]);

impl U32x4 {
    #[inline(always)]
    fn wrapping_add(self, rhs: Self) -> Self {
        let mut result = [0u32; 4];
        (0..4).for_each(|i| {
            result[i] = self.0[i].wrapping_add(rhs.0[i]);
        });
        U32x4(result)
    }

    #[inline(always)]
    fn rotate_left(self, n: u32) -> Self {
        let mut result = [0u32; 4];
        (0..4).for_each(|i| {
            result[i] = self.0[i].rotate_left(n);
        });
        U32x4(result)
    }

    #[inline(always)]
    fn rotate_elements_left<const N: u32>(self) -> Self {
        let mut result = [0u32; 4];
        (0..4).for_each(|i| {
            result[i] = self.0[(i + N as usize) % 4];
        });
        U32x4(result)
    }

    #[inline(always)]
    fn rotate_elements_right<const N: u32>(self) -> Self {
        let mut result = [0u32; 4];
        (0..4).for_each(|i| {
            result[i] = self.0[(i + 4 - N as usize) % 4];
        });
        U32x4(result)
    }

    #[inline(always)]
    fn to_le_bytes(self) -> [u8; 16] {
        let mut bytes = [0u8; 16];
        bytes[0..4].copy_from_slice(&self.0[0].to_le_bytes());
        bytes[4..8].copy_from_slice(&self.0[1].to_le_bytes());
        bytes[8..12].copy_from_slice(&self.0[2].to_le_bytes());
        bytes[12..16].copy_from_slice(&self.0[3].to_le_bytes());
        bytes
    }
}

impl BitXor for U32x4 {
    type Output = Self;

    #[inline(always)]
    fn bitxor(self, rhs: Self) -> Self {
        let mut result = [0u32; 4];
        (0..4).for_each(|i| {
            result[i] = self.0[i] ^ rhs.0[i];
        });
        U32x4(result)
    }
}

Custom U32x4 type.

This type has a few SIMD relevant design choices:

  • Heavy use of inline function calls to allow compiler to recognize as vectorizable.
  • Heavy use of for-each loops which are also easy for the compiler to recognize as vectorizable.
  • The type is a based on an array instead of tuple since the heterogeneous nature of tuples can confuse the compiler into thinking it is not vectorizable.
  • Explicit memory layout lines up with SIMD size.

SIMD Results

The new U32x4 type is the heart of the change, but some tweaks are necessary in order for the compiler to pick up on how data flows from the top of the chacha_block function. The quarter_round now operates on 4 U32x4’s instead of 4 scalars.

// Four quarter rounds performed on the entire state of the
// cipher in a vectorized SIMD friendly fashion.
#[inline(always)]
fn quarter_round(a: U32x4, b: U32x4, c: U32x4, d: U32x4) -> (U32x4, U32x4, U32x4, U32x4) {
    let a = a.wrapping_add(b);
    let d = d.bitxor(a).rotate_left(16);

    let c = c.wrapping_add(d);
    let b = b.bitxor(c).rotate_left(12);

    let a = a.wrapping_add(b);
    let d = d.bitxor(a).rotate_left(8);

    let c = c.wrapping_add(d);
    let b = b.bitxor(c).rotate_left(7);

    (a, b, c, d)
}

// Perform a round on "columns" and then "diagonals" of the state.
//
// The column quarter rounds are made up of indexes: `[0,4,8,12]`, `[1,5,9,13]`, `[2,6,10,14]`, `[3,7,11,15]`.
// The diagonals quarter rounds are made up of indexes: `[0,5,10,15]`, `[1,6,11,12]`, `[2,7,8,13]`, `[3,4,9,14]`.
//
// The underlying quarter_round function is vectorized using the
// u32x4 type in order to perform 4 quarter round functions at the same time.
// This is a little more difficult to read, but it gives the compiler
// a strong hint to use the performant SIMD instructions.
#[inline(always)]
fn double_round(a: U32x4, b: U32x4, c: U32x4, d: U32x4) -> (U32x4, U32x4, U32x4, U32x4) {
    // Vectorized column quarter round.
    let (a, b, c, d) = Self::quarter_round(a, b, c, d);

    // Rotate the words into their diagonal positions
    // in order to efficiently use the same quarter round function.
    let b = b.rotate_elements_left::<1>();
    let c = c.rotate_elements_left::<2>();
    let d = d.rotate_elements_left::<3>();

    // Vectorized diagonal quarter round.
    let (a, b, c, d) = Self::quarter_round(a, b, c, d);

    // Rotate the words back into their normal positions.
    (
        a,
        b.rotate_elements_right::<1>(),
        c.rotate_elements_right::<2>(),
        d.rotate_elements_right::<3>(),
    )
}

/// Transform the state by performing the ChaCha block function.
fn chacha_block(&mut self) {
    // State is pulled out into temporary variables
    // to make the function more vectorizable.
    let mut a = self.a;
    let mut b = self.b;
    let mut c = self.c;
    let mut d = self.d;

    for _ in 0..10 {
        (a, b, c, d) = Self::double_round(a, b, c, d);
    }

    // Add the initial state to the final state
    self.a = a.wrapping_add(self.a);
    self.b = b.wrapping_add(self.b);
    self.c = c.wrapping_add(self.c);
    self.d = d.wrapping_add(self.d);
}

The block and round functions tweaked to take the new type as inputs, making the vectorization possible.

Ok, with that code in place, here are the asm results.

Without the opt-level=3 or target-cpu=native flags, the code is entirely scalar. It is not equivalent to the naive version above, so maybe the compiler is able to make some other optimizations, but they are not SIMD related. With opt-level=3, there is still no SIMD usage which I find disappointing, but I am not sure if it is a deal-breaker. With the target-cpu=native flag, the code is entirely SIMD optimized, including the quarter round. There is rampant use of the xmm* registers and SIMD instructions throughout.

.section .text.chacha20_poly1305::chacha20::State::chacha_block,"ax",@progbits
	.p2align	4, 0x90
	.type	chacha20_poly1305::chacha20::State::chacha_block,@function
chacha20_poly1305::chacha20::State::chacha_block:
	.cfi_startproc
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 224
		let mut a = self.a;
	mov esi, dword ptr [rdi]
	vmovdqu xmm2, xmmword ptr [rdi + 4]
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 225
		let mut b = self.b;
	mov r10d, dword ptr [rdi + 32]
	vmovdqu xmm1, xmmword ptr [rdi + 20]
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 226
		let mut c = self.c;
	vmovdqu xmm0, xmmword ptr [rdi + 36]
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 227
		let mut d = self.d;
	mov edx, dword ptr [rdi + 52]
	mov ecx, dword ptr [rdi + 56]
	mov eax, dword ptr [rdi + 60]
	vpshufd xmm3, xmm1, 198
	vpshufd xmm4, xmm2, 161
	vpinsrd xmm8, xmm4, esi, 2
	vpshufd xmm4, xmm0, 24
	vpinsrd xmm10, xmm4, edx, 0
	vpshufd xmm4, xmm0, 255
	vpinsrd xmm11, xmm4, eax, 1
	vpblendd xmm7, xmm3, xmm2, 8
	mov r8d, 10
	vmovdqa xmm3, xmmword ptr [rip + .LCPI0_0]
	vmovdqa xmm4, xmmword ptr [rip + .LCPI0_1]
	vmovdqa xmm5, xmmword ptr [rip + .LCPI0_2]
	vmovdqa xmm6, xmmword ptr [rip + .LCPI0_3]
	mov r9d, ecx
	.p2align	4, 0x90
.LBB0_1:
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	vpshufd xmm9, xmm7, 57
	vpaddd xmm8, xmm9, xmm8
	vpinsrd xmm9, xmm10, r10d, 0
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
		result[i] = self.0[i] ^ rhs.0[i];
	vpshufd xmm12, xmm8, 78
	vpbroadcastd xmm10, xmm10
	vpblendd xmm10, xmm12, xmm10, 8
	vpunpcklqdq xmm11, xmm11, xmm8
	vpinsrd xmm11, xmm11, r9d, 2
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
		return intrinsics::rotate_left(self, n);
	vpshufb xmm11, xmm11, xmm3
	vpshufb xmm10, xmm10, xmm3
	vpxor xmm10, xmm10, xmm11
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	vpaddd xmm9, xmm10, xmm9
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
		result[i] = self.0[i] ^ rhs.0[i];
	vpshufd xmm11, xmm9, 57
	vpxor xmm7, xmm11, xmm7
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
		return intrinsics::rotate_left(self, n);
	vpsrld xmm11, xmm7, 20
	vpslld xmm7, xmm7, 12
	vpor xmm7, xmm11, xmm7
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	vpshufd xmm11, xmm7, 57
	vpaddd xmm8, xmm11, xmm8
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
		return intrinsics::rotate_left(self, n);
	vpshufb xmm11, xmm8, xmm4
	vpshufb xmm10, xmm10, xmm5
	vpxor xmm10, xmm11, xmm10
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	vpaddd xmm9, xmm10, xmm9
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
		result[i] = self.0[i] ^ rhs.0[i];
	vpshufd xmm11, xmm9, 57
	vpxor xmm7, xmm11, xmm7
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
		return intrinsics::rotate_left(self, n);
	vpsrld xmm11, xmm7, 25
	vpslld xmm7, xmm7, 7
	vpor xmm7, xmm11, xmm7
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	vpaddd xmm8, xmm8, xmm7
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
		return intrinsics::rotate_left(self, n);
	vpshufb xmm11, xmm8, xmm3
	vpshufb xmm10, xmm10, xmm6
	vpxor xmm10, xmm11, xmm10
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	vpaddd xmm11, xmm10, xmm9
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
		result[i] = self.0[i] ^ rhs.0[i];
	vpxor xmm7, xmm11, xmm7
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
		return intrinsics::rotate_left(self, n);
	vpsrld xmm9, xmm7, 20
	vpslld xmm7, xmm7, 12
	vpor xmm7, xmm9, xmm7
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	vpaddd xmm8, xmm8, xmm7
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
		return intrinsics::rotate_left(self, n);
	vpshufb xmm9, xmm8, xmm5
	vpshufb xmm10, xmm10, xmm5
	vpxor xmm9, xmm9, xmm10
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	vpaddd xmm12, xmm9, xmm11
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
		result[i] = self.0[i] ^ rhs.0[i];
	vpxor xmm7, xmm12, xmm7
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
		return intrinsics::rotate_left(self, n);
	vpsrld xmm10, xmm7, 25
	vpslld xmm7, xmm7, 7
	vpor xmm7, xmm10, xmm7
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/iter/range.rs : 753
		if self.start < self.end {
	vmovd r10d, xmm12
	vpextrd r9d, xmm9, 3
	vpblendd xmm10, xmm12, xmm9, 1
	vpshufd xmm11, xmm9, 233
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/cmp.rs : 1565
		fn lt(&self, other: &$t) -> bool { (*self) < (*other) }
	dec r8d
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/iter/range.rs : 753
		if self.start < self.end {
	jne .LBB0_1
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	vpextrd r8d, xmm8, 2
	add r8d, esi
	vpshufd xmm3, xmm8, 241
	vpblendd xmm3, xmm3, xmm7, 8
	vpaddd xmm2, xmm3, xmm2
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 234
		self.a = a.wrapping_add(self.a);
	mov dword ptr [rdi], r8d
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	vpbroadcastd xmm3, xmm12
	vpshufd xmm4, xmm7, 198
	vpblendd xmm3, xmm4, xmm3, 8
	vpaddd xmm1, xmm3, xmm1
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 234
		self.a = a.wrapping_add(self.a);
	vmovdqu xmmword ptr [rdi + 4], xmm2
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	vpbroadcastq xmm2, xmm9
	vpshufd xmm3, xmm12, 219
	vpblendd xmm2, xmm3, xmm2, 8
	vpaddd xmm0, xmm2, xmm0
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 235
		self.b = b.wrapping_add(self.b);
	vmovdqu xmmword ptr [rdi + 20], xmm1
		// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
		intrinsics::wrapping_add(self, rhs)
	vmovd esi, xmm9
	add esi, edx
	add r9d, ecx
	vpextrd ecx, xmm9, 2
	add ecx, eax
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 236
		self.c = c.wrapping_add(self.c);
	vmovdqu xmmword ptr [rdi + 36], xmm0
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 237
		self.d = d.wrapping_add(self.d);
	mov dword ptr [rdi + 52], esi
	mov dword ptr [rdi + 56], r9d
	mov dword ptr [rdi + 60], ecx
		// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 238
		}
	ret

Fully optimized, modern compiler, ams output.

And what about the old MSRV compiler with the new code? With fully optimized asm output, the 1.63.0 version of rustc is partially using SIMD instructions with the new code. It is better than the initial naive version, but the quarter round is not using SIMD and generally the asm is using more instructions than the most optimized version of the code above. This isn’t ideal, but might be all the juice we can squeeze out of the older compilers.

That leaves just one question: is it unusual for a library to require the target-cpu flag to get SIMD instructions? Using LDK’s ChaCha20 implementation as a benchmark, it does not use SIMD instructions without at least the opt-level=3 flag. And the quarter round is also not vecotorized without target-cpu. So perhaps the target-cpu flag is a necessity for now until the usage of Rust’s native SIMD types is possible and the compiler gets even smarter.