// Making use of SIMD #Computer-Science
The adventure continues with the ChaCha20Poly1305 AEAD implementation. It is time to make it more performant.
The goal in any ecosystem leveraging encryption is to get as many of its users as possible using the encryption. If for any reason it is super expensive to use encryption, than a user may decide to just not use it. This chips away at the pool of users who are actually using encryption. And if that pool (the anonymity set) gets small enough, the encryption becomes less powerful. The metadata of who is using the encryption becomes valuable. Why are Bob and Alice the only two people using encryption? What is so valuable that they are paying the high cost to hide it?
Ideally the use of encryption is free and on by default for all users. Now, it is impossible for it to by 100% free. At some point the work to cryptographically scramble the bits needs to happen. But if it is efficient enough that a user can’t even tell that it is happening, than that is usually as good as free. This will keep the anonymity set large and the encryption valuable for the ecosystem.
In our implementation of the ChaCha20 stream cipher, the bit scrambling (heavy lifting) happens in the chacha_block
function. So we need to make sure the chacha_block
function is as efficient as possible.
One performance tool is an architecture’s Single Instruction Multiple Data (SIMD) assembly instructions. We will see below how the requirements of SIMD instruction usage lines up well with cryptographic computations. Ideally the Rust compiler would automatically make use of the SIMD assembly instructions for us. It would just magically “auto-vector-ize” the code without having to rely on explicit architecture dependent “intrinsic” hints from the developer. But as of today…it actually kinda of does the magic! However, the code usually needs to be massaged a bit in order for the compiler to automatically recognize a vectoriz-able function. While this adds complexity, it is way more maintainable than per-architecture settings.
Naive Implementation with Rust v1.80.1
So we start by testing the initial naive ChaCha20 implementation which makes no effort to help the compiler recognize vectoriazable functions. It just follows the RFC spec, and while subjective, I think it is pretty easy to read.
Tests are performed on the current stable version of rustc
which is 1.80.1
. The more modern tooling is generally better to diagnose asm (assembly) usage than the conservative MSRV 1.63.0
of the rust-bitcoin project. And hopefully any findings can be leveraged by older Rust compilers without too much trouble.
The tests involve two compile time flags which enable the compiler to squeeze out as much SIMD instructions it can for the code.
opt-level=3
– Turn the optimization level to the maximum level of3
, a.k.a.cargo
’s--release
profile. Requiring this flag doesn’t feel like a huge ask for library users since it is the cargo default.target-cpu=native
– A general “tune the shit out of this code for this specific architecture”. I imagine this is at the cost of the resulting executable being less portable, but I haven’t dug into the weight of the trade-offs here. But it is likely a larger ask for library users than the optimization level flag.
The naive implementation of chacha_block
is at least very well contained in the State
struct which represents the state of the ChaCha20 cipher. This makes it easy to analyze with tools like asm
to see exactly what the functions are boiling down to in the assembly and if the compiler is taking advantage of SIMD instructions.
/// The 512-bit cipher state is chunk'd up into 16 32-bit words.
///
/// The 16 words can be visualized as a 4x4 matrix:
///
/// 0 1 2 3
/// 4 5 6 7
/// 8 9 10 11
/// 12 13 14 15
#[derive(Clone, Copy, Debug)]
struct State([u32; 16]);
impl State {
/// New prepared state.
const fn new(key: SessionKey, nonce: Nonce, count: u32) -> Self {
let mut state: [u32; 16] =
[WORD_1, WORD_2, WORD_3, WORD_4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
// The next two rows (8 words) are based on the session key.
// Using a while loop here to keep the function constant.
let mut i = 0;
while i < 8 {
state[i + 4] = u32::from_le_bytes([
key.0[i * 4],
key.0[i * 4 + 1],
key.0[i * 4 + 2],
key.0[i * 4 + 3],
]);
i += 1;
}
// The final row is the count and the nonce.
state[12] = count;
state[13] = u32::from_le_bytes([nonce.0[0], nonce.0[1], nonce.0[2], nonce.0[3]]);
state[14] = u32::from_le_bytes([nonce.0[4], nonce.0[5], nonce.0[6], nonce.0[7]]);
state[15] = u32::from_le_bytes([nonce.0[8], nonce.0[9], nonce.0[10], nonce.0[11]]);
State(state)
}
/// Each "quarter" round of ChaCha scrambles 4 of the 16 words (where the quarter comes from)
/// that make up the state using some Addition (mod 2^32), Rotation, and XOR (a.k.a. ARX).
fn quarter_round(&mut self, a: usize, b: usize, c: usize, d: usize) {
self.0[a] = self.0[a].wrapping_add(self.0[b]);
self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16);
self.0[c] = self.0[c].wrapping_add(self.0[d]);
self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(12);
self.0[a] = self.0[a].wrapping_add(self.0[b]);
self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(8);
self.0[c] = self.0[c].wrapping_add(self.0[d]);
self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(7);
}
/// Transform the state by performing the ChaCha block function.
fn chacha_block(&mut self) {
let initial_state = self.0;
for _ in 0..10 {
for (a, b, c, d) in CHACHA_ROUND_INDICIES {
self.quarter_round(a, b, c, d);
}
}
for (modified, initial) in self.0.iter_mut().zip(initial_state.iter()) {
*modified = modified.wrapping_add(*initial)
}
}
/// Expose the 512-bit state as a byte stream.
fn keystream(&self) -> [u8; 64] {
let mut keystream: [u8; 64] = [0; 64];
for (k, s) in keystream.chunks_exact_mut(4).zip(self.0) {
k.copy_from_slice(&s.to_le_bytes());
}
keystream
}
}
The initial naive implementation of the ChaCha20 cipher.
The following long output is the asm of the function generated with the cargo-show-asm
tool: cargo asm --package chacha20-poly1305 chacha20_poly1305::chacha20::State::chacha_block
. Using the --rust
flag with cargo-show-asm
makes it output little hints as to where you are in the Rust source which is super helpful.
And right away we have some findings!
- The
quarter_round
function is always being inlined (no new function on the call-stack) by the compiler no matter the optimization level, so in order to analyze its machine code we need to add a#[inline(never)]
just for testing purposes. - The
chacha_block
function has some pretty easy to find for loops, the first nested pair map to.LBB0_1
(outer) and.LBB0_2
(inner). .LBB0_2
clearly contains all thequarter_round
craziness, the function has definitely been inlined by the compiler.- There are some SIMD instructions being used, like
movdqu
andpaddd
, but they are only used in the third (last) for loop of thechacha_block
function. The inner quarter round operations are performed using scalar instructions (add
,xor
,rol
).movups
,movaps
,movdqu
,movdqa
– All are ways to efficiently move around 128 bits in the special SIMDXMM*
registers.paddd
– Parallel addition of four 32-bit integers.- These are all SSE (Streaming SIMD Extensions) instructions which are a subset of all possible SIMD instructions. More modern AVX ones are also possible. AVX operate on larger number of bits (256 and 512) so can be even more efficient.
- Even with both the
opt-level=3
andtarget-cpu=native
flags enabled, the quarter round function does not use SIMD instructions. Sad!
.section .text.chacha20_poly1305::chacha20::State::chacha_block,"ax",@progbits
.p2align 4, 0x90
.type chacha20_poly1305::chacha20::State::chacha_block,@function
chacha20_poly1305::chacha20::State::chacha_block:
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 103
fn chacha_block(&mut self) {
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
push r15
.cfi_def_cfa_offset 24
push r14
.cfi_def_cfa_offset 32
push r12
.cfi_def_cfa_offset 40
push rbx
.cfi_def_cfa_offset 48
sub rsp, 336
.cfi_def_cfa_offset 384
.cfi_offset rbx, -48
.cfi_offset r12, -40
.cfi_offset r14, -32
.cfi_offset r15, -24
.cfi_offset rbp, -16
mov rbx, rdi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 104
let initial_state = self.0;
movups xmm0, xmmword ptr [rdi]
movaps xmmword ptr [rsp + 48], xmm0
movups xmm0, xmmword ptr [rdi + 16]
movaps xmmword ptr [rsp + 32], xmm0
movups xmm0, xmmword ptr [rdi + 32]
movaps xmmword ptr [rsp + 16], xmm0
movups xmm0, xmmword ptr [rdi + 48]
movaps xmmword ptr [rsp], xmm0
xor ebp, ebp
lea r14, [rip + .L__unnamed_1]
lea r15, [rsp + 64]
mov r12, qword ptr [rip + memcpy@GOTPCREL]
.p2align 4, 0x90
.LBB0_1:
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/int_macros.rs : 2222
let (a, b) = intrinsics::add_with_overflow(self as $ActualT, rhs as $ActualT);
inc ebp
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 106
for (a, b, c, d) in CHACHA_ROUND_INDICIES {
mov edx, 256
mov rdi, r15
mov rsi, r14
call r12
mov qword ptr [rsp + 320], 0
mov qword ptr [rsp + 328], 8
mov edx, 24
.p2align 4, 0x90
.LBB0_2:
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ptr/mod.rs : 1325
crate::intrinsics::read_via_copy(src)
mov rax, qword ptr [rsp + rdx + 40]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 92
self.0[a] = self.0[a].wrapping_add(self.0[b]);
cmp rax, 15
ja .LBB0_9
mov rdi, qword ptr [rsp + rdx + 48]
cmp rdi, 16
jae .LBB0_10
mov rcx, qword ptr [rsp + rdx + 56]
mov r8, qword ptr [rsp + rdx + 64]
mov esi, dword ptr [rbx + 4*rdi]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
add esi, dword ptr [rbx + 4*rax]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 92
self.0[a] = self.0[a].wrapping_add(self.0[b]);
mov dword ptr [rbx + 4*rax], esi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 93
self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16);
cmp r8, 16
jae .LBB0_11
xor esi, dword ptr [rbx + 4*r8]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
rol esi, 16
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 93
self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16);
mov dword ptr [rbx + 4*r8], esi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 94
self.0[c] = self.0[c].wrapping_add(self.0[d]);
cmp rcx, 16
jae .LBB0_12
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
add esi, dword ptr [rbx + 4*rcx]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 94
self.0[c] = self.0[c].wrapping_add(self.0[d]);
mov dword ptr [rbx + 4*rcx], esi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 95
self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(12);
xor esi, dword ptr [rbx + 4*rdi]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
rol esi, 12
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 95
self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(12);
mov dword ptr [rbx + 4*rdi], esi
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
add esi, dword ptr [rbx + 4*rax]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 96
self.0[a] = self.0[a].wrapping_add(self.0[b]);
mov dword ptr [rbx + 4*rax], esi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 97
self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(8);
xor esi, dword ptr [rbx + 4*r8]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
rol esi, 8
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 97
self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(8);
mov dword ptr [rbx + 4*r8], esi
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
add esi, dword ptr [rbx + 4*rcx]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 98
self.0[c] = self.0[c].wrapping_add(self.0[d]);
mov dword ptr [rbx + 4*rcx], esi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 99
self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(7);
xor esi, dword ptr [rbx + 4*rdi]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
rol esi, 7
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 99
self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(7);
mov dword ptr [rbx + 4*rdi], esi
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/index_range.rs : 119
if self.len() > 0 {
add rdx, 32
cmp rdx, 280
jne .LBB0_2
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/cmp.rs : 1565
fn lt(&self, other: &$t) -> bool { (*self) < (*other) }
cmp ebp, 10
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/iter/range.rs : 753
if self.start < self.end {
jne .LBB0_1
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
*modified = modified.wrapping_add(*initial)
movdqu xmm0, xmmword ptr [rbx]
movdqa xmm1, xmmword ptr [rsp + 48]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
paddd xmm1, xmm0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
*modified = modified.wrapping_add(*initial)
movdqu xmm0, xmmword ptr [rbx + 16]
movdqa xmm2, xmmword ptr [rsp + 32]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
paddd xmm2, xmm0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
*modified = modified.wrapping_add(*initial)
movdqu xmm0, xmmword ptr [rbx + 32]
movdqa xmm3, xmmword ptr [rsp + 16]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
paddd xmm3, xmm0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
*modified = modified.wrapping_add(*initial)
movdqu xmm0, xmmword ptr [rbx + 48]
movdqa xmm4, xmmword ptr [rsp]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
paddd xmm4, xmm0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
*modified = modified.wrapping_add(*initial)
movdqu xmmword ptr [rbx], xmm1
movdqu xmmword ptr [rbx + 16], xmm2
movdqu xmmword ptr [rbx + 32], xmm3
movdqu xmmword ptr [rbx + 48], xmm4
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 113
}
add rsp, 336
.cfi_def_cfa_offset 48
pop rbx
.cfi_def_cfa_offset 40
pop r12
.cfi_def_cfa_offset 32
pop r14
.cfi_def_cfa_offset 24
pop r15
.cfi_def_cfa_offset 16
pop rbp
.cfi_def_cfa_offset 8
ret
.LBB0_9:
.cfi_def_cfa_offset 384
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 92
self.0[a] = self.0[a].wrapping_add(self.0[b]);
lea rdx, [rip + .L__unnamed_2]
mov esi, 16
mov rdi, rax
call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
.LBB0_10:
lea rdx, [rip + .L__unnamed_3]
mov esi, 16
call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
.LBB0_11:
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 93
self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16);
lea rdx, [rip + .L__unnamed_4]
mov esi, 16
mov rdi, r8
call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
.LBB0_12:
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 94
self.0[c] = self.0[c].wrapping_add(self.0[d]);
lea rdx, [rip + .L__unnamed_5]
mov esi, 16
mov rdi, rcx
call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
cargo asm –lib –rust –package chacha20-poly1305 chacha20_poly1305::chacha20::State::chacha_block
Adding the target-cpu=native
compile flag enables a bit more SIMD usage on my laptop. It uses the 256-bit AVX2 instructions instead of the 128-bit SSE2. But it still does not optimize the quarter round function, so not a game changer.
Type Modifications
So something needs to change with the quarter_round
function so the compiler knows it can use SIMD instructions in it. The bulk of work is happening in the quarter_round
, so for it to not be optimized is a big performance blow. But we don’t want to depend on intrinsic hints (e.g. #[target_feature(enable = "avx2")]
) since they are architecture specific and would be a pain to maintain.
One option which allows for more optimizations is to generate multiple blocks in parallel. The only difference in the inputs would be a different count for the block counter. This is a form of Horizontal Vectorization. Each state representation, 512-bits, can theoretically be operated on in parallel. And it nicely fits into some of the most modern SIMD AVX-512
instructions, which I doubt is a coincidence.
[a1, a2, a3, a4] -> a1 + a2 + a3 + a4
[b1, b2, b3, b4] -> b1 + b2 + b3 + b4
Horizontal Vectorization, each vector is operated on in parallel.
This does introduce quite a bit more bookkeeping complexity for the cipher implementation to keep track of the count and making sure generated blocks are used in order. Any missteps and the train goes off the rails between the parties communicating. On the plus side, it is simply a “wrapper” around the current State
implementation, but I am not quite sure what other “tips” I would give the compiler in order to leverage this. So going to focus on lower hanging fruit for the time being.
And that fruit is Verticle Vectorization.
[a1, a2, a3, a4] + [b1, b2, b3, b4] = [a1+b1, a2+b2, a3+b3, a4+b4]
Verticle Vectorization, operates on date in multiple vectors in parallel.
When looking at ChaCha’s quarter round function, the simple scalar approach operates on 4 32-bit words at a time, which is quarter of the total state. A quick recap here, the cipher’s state is broken up into 16 words. When the state is updated (a.k.a. a block created), the quarter round function is first run on the words with indexes [0,4,8,12]
. Looking at the ascii art below, you can visualize this as a “column” of the state.
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
4x4 matrix indexes.
Next, a quarter round is run on [1,5,9,13]
, then [2,6,10,14]
, and finally [3,7,11,15]
. All the rows! It then runs some quarter rounds across some “diagonals”, but before going there, notice that the row quarter rounds each act on an independent chunk of data. So, while the scalar quarter round implementation acts on a “column”, the vectorized approach will take all the “rows” and apply the quarter round to each “lane” (SIMD word for column).
[ 0 | 1 | 2 | 3 ] // U32x4
[ 4 | 5 | 6 | 7 ] // U32x4
[ 8 | 9 | 10 | 11 ] // U32x4
[ 12 |13 | 14 | 15 ] // U32x4
Rows and Lanes.
Each row is a vector of 4 32-bit words. Not by chance, this vector size fits perfectly into a lot of SIMD instructions. The quarter round operations are still applied on the columns (e.g. [0,4,8,12]
), but this can be done in parallel with the other columns by leveraging the SIMD lanes. We just have to organize our data in a way for the compiler to pick up on the fact that it can do this. And as we saw above in the test on the naive code, tossing a bunch of tuples at it ain’t good enough.
The Rust std
library actually has some examples of some helpful SIMD types in the appropriately named std::simd
crate: u32x4
and u32x16
. These are also available in core
which is good news for no-std support. However, the interface is still experimental and feature gated on 86656, and the portable_simd
feature is not available in Rust 1.63.0 (rust-bitcoin’s MSRV). Bummer! But we can experiment with types similar to the std library (which by the way is what LDK did in their ChaCha implementation).
And boom, here is the new fancy U32x4
type which matches the std library’s simd interface for easy migration in the future. This one though implements the bare-minimum for ChaCha.
#[repr(C, align(16))]
#[derive(Clone, Copy, Debug, PartialEq)]
struct U32x4([u32; 4]);
impl U32x4 {
#[inline(always)]
fn wrapping_add(self, rhs: Self) -> Self {
let mut result = [0u32; 4];
(0..4).for_each(|i| {
result[i] = self.0[i].wrapping_add(rhs.0[i]);
});
U32x4(result)
}
#[inline(always)]
fn rotate_left(self, n: u32) -> Self {
let mut result = [0u32; 4];
(0..4).for_each(|i| {
result[i] = self.0[i].rotate_left(n);
});
U32x4(result)
}
#[inline(always)]
fn rotate_elements_left<const N: u32>(self) -> Self {
let mut result = [0u32; 4];
(0..4).for_each(|i| {
result[i] = self.0[(i + N as usize) % 4];
});
U32x4(result)
}
#[inline(always)]
fn rotate_elements_right<const N: u32>(self) -> Self {
let mut result = [0u32; 4];
(0..4).for_each(|i| {
result[i] = self.0[(i + 4 - N as usize) % 4];
});
U32x4(result)
}
#[inline(always)]
fn to_le_bytes(self) -> [u8; 16] {
let mut bytes = [0u8; 16];
bytes[0..4].copy_from_slice(&self.0[0].to_le_bytes());
bytes[4..8].copy_from_slice(&self.0[1].to_le_bytes());
bytes[8..12].copy_from_slice(&self.0[2].to_le_bytes());
bytes[12..16].copy_from_slice(&self.0[3].to_le_bytes());
bytes
}
}
impl BitXor for U32x4 {
type Output = Self;
#[inline(always)]
fn bitxor(self, rhs: Self) -> Self {
let mut result = [0u32; 4];
(0..4).for_each(|i| {
result[i] = self.0[i] ^ rhs.0[i];
});
U32x4(result)
}
}
Custom U32x4 type.
This type has a few SIMD relevant design choices:
- Heavy use of inline function calls to allow compiler to recognize as vectorizable.
- Heavy use of for-each loops which are also easy for the compiler to recognize as vectorizable.
- The type is a based on an array instead of tuple since the heterogeneous nature of tuples can confuse the compiler into thinking it is not vectorizable.
- Explicit memory layout lines up with SIMD size.
SIMD Results
The new U32x4
type is the heart of the change, but some tweaks are necessary in order for the compiler to pick up on how data flows from the top of the chacha_block
function. The quarter_round
now operates on 4 U32x4
’s instead of 4 scalars.
// Four quarter rounds performed on the entire state of the
// cipher in a vectorized SIMD friendly fashion.
#[inline(always)]
fn quarter_round(a: U32x4, b: U32x4, c: U32x4, d: U32x4) -> (U32x4, U32x4, U32x4, U32x4) {
let a = a.wrapping_add(b);
let d = d.bitxor(a).rotate_left(16);
let c = c.wrapping_add(d);
let b = b.bitxor(c).rotate_left(12);
let a = a.wrapping_add(b);
let d = d.bitxor(a).rotate_left(8);
let c = c.wrapping_add(d);
let b = b.bitxor(c).rotate_left(7);
(a, b, c, d)
}
// Perform a round on "columns" and then "diagonals" of the state.
//
// The column quarter rounds are made up of indexes: `[0,4,8,12]`, `[1,5,9,13]`, `[2,6,10,14]`, `[3,7,11,15]`.
// The diagonals quarter rounds are made up of indexes: `[0,5,10,15]`, `[1,6,11,12]`, `[2,7,8,13]`, `[3,4,9,14]`.
//
// The underlying quarter_round function is vectorized using the
// u32x4 type in order to perform 4 quarter round functions at the same time.
// This is a little more difficult to read, but it gives the compiler
// a strong hint to use the performant SIMD instructions.
#[inline(always)]
fn double_round(a: U32x4, b: U32x4, c: U32x4, d: U32x4) -> (U32x4, U32x4, U32x4, U32x4) {
// Vectorized column quarter round.
let (a, b, c, d) = Self::quarter_round(a, b, c, d);
// Rotate the words into their diagonal positions
// in order to efficiently use the same quarter round function.
let b = b.rotate_elements_left::<1>();
let c = c.rotate_elements_left::<2>();
let d = d.rotate_elements_left::<3>();
// Vectorized diagonal quarter round.
let (a, b, c, d) = Self::quarter_round(a, b, c, d);
// Rotate the words back into their normal positions.
(
a,
b.rotate_elements_right::<1>(),
c.rotate_elements_right::<2>(),
d.rotate_elements_right::<3>(),
)
}
/// Transform the state by performing the ChaCha block function.
fn chacha_block(&mut self) {
// State is pulled out into temporary variables
// to make the function more vectorizable.
let mut a = self.a;
let mut b = self.b;
let mut c = self.c;
let mut d = self.d;
for _ in 0..10 {
(a, b, c, d) = Self::double_round(a, b, c, d);
}
// Add the initial state to the final state
self.a = a.wrapping_add(self.a);
self.b = b.wrapping_add(self.b);
self.c = c.wrapping_add(self.c);
self.d = d.wrapping_add(self.d);
}
The block and round functions tweaked to take the new type as inputs, making the vectorization possible.
Ok, with that code in place, here are the asm results.
Without the opt-level=3
or target-cpu=native
flags, the code is entirely scalar. It is not equivalent to the naive version above, so maybe the compiler is able to make some other optimizations, but they are not SIMD related. With opt-level=3
, there is still no SIMD usage which I find disappointing, but I am not sure if it is a deal-breaker. With the target-cpu=native
flag, the code is entirely SIMD optimized, including the quarter round. There is rampant use of the xmm*
registers and SIMD instructions throughout.
.section .text.chacha20_poly1305::chacha20::State::chacha_block,"ax",@progbits
.p2align 4, 0x90
.type chacha20_poly1305::chacha20::State::chacha_block,@function
chacha20_poly1305::chacha20::State::chacha_block:
.cfi_startproc
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 224
let mut a = self.a;
mov esi, dword ptr [rdi]
vmovdqu xmm2, xmmword ptr [rdi + 4]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 225
let mut b = self.b;
mov r10d, dword ptr [rdi + 32]
vmovdqu xmm1, xmmword ptr [rdi + 20]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 226
let mut c = self.c;
vmovdqu xmm0, xmmword ptr [rdi + 36]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 227
let mut d = self.d;
mov edx, dword ptr [rdi + 52]
mov ecx, dword ptr [rdi + 56]
mov eax, dword ptr [rdi + 60]
vpshufd xmm3, xmm1, 198
vpshufd xmm4, xmm2, 161
vpinsrd xmm8, xmm4, esi, 2
vpshufd xmm4, xmm0, 24
vpinsrd xmm10, xmm4, edx, 0
vpshufd xmm4, xmm0, 255
vpinsrd xmm11, xmm4, eax, 1
vpblendd xmm7, xmm3, xmm2, 8
mov r8d, 10
vmovdqa xmm3, xmmword ptr [rip + .LCPI0_0]
vmovdqa xmm4, xmmword ptr [rip + .LCPI0_1]
vmovdqa xmm5, xmmword ptr [rip + .LCPI0_2]
vmovdqa xmm6, xmmword ptr [rip + .LCPI0_3]
mov r9d, ecx
.p2align 4, 0x90
.LBB0_1:
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpshufd xmm9, xmm7, 57
vpaddd xmm8, xmm9, xmm8
vpinsrd xmm9, xmm10, r10d, 0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
vpshufd xmm12, xmm8, 78
vpbroadcastd xmm10, xmm10
vpblendd xmm10, xmm12, xmm10, 8
vpunpcklqdq xmm11, xmm11, xmm8
vpinsrd xmm11, xmm11, r9d, 2
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpshufb xmm11, xmm11, xmm3
vpshufb xmm10, xmm10, xmm3
vpxor xmm10, xmm10, xmm11
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpaddd xmm9, xmm10, xmm9
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
vpshufd xmm11, xmm9, 57
vpxor xmm7, xmm11, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpsrld xmm11, xmm7, 20
vpslld xmm7, xmm7, 12
vpor xmm7, xmm11, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpshufd xmm11, xmm7, 57
vpaddd xmm8, xmm11, xmm8
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpshufb xmm11, xmm8, xmm4
vpshufb xmm10, xmm10, xmm5
vpxor xmm10, xmm11, xmm10
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpaddd xmm9, xmm10, xmm9
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
vpshufd xmm11, xmm9, 57
vpxor xmm7, xmm11, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpsrld xmm11, xmm7, 25
vpslld xmm7, xmm7, 7
vpor xmm7, xmm11, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpaddd xmm8, xmm8, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpshufb xmm11, xmm8, xmm3
vpshufb xmm10, xmm10, xmm6
vpxor xmm10, xmm11, xmm10
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpaddd xmm11, xmm10, xmm9
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
vpxor xmm7, xmm11, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpsrld xmm9, xmm7, 20
vpslld xmm7, xmm7, 12
vpor xmm7, xmm9, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpaddd xmm8, xmm8, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpshufb xmm9, xmm8, xmm5
vpshufb xmm10, xmm10, xmm5
vpxor xmm9, xmm9, xmm10
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpaddd xmm12, xmm9, xmm11
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
vpxor xmm7, xmm12, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpsrld xmm10, xmm7, 25
vpslld xmm7, xmm7, 7
vpor xmm7, xmm10, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/iter/range.rs : 753
if self.start < self.end {
vmovd r10d, xmm12
vpextrd r9d, xmm9, 3
vpblendd xmm10, xmm12, xmm9, 1
vpshufd xmm11, xmm9, 233
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/cmp.rs : 1565
fn lt(&self, other: &$t) -> bool { (*self) < (*other) }
dec r8d
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/iter/range.rs : 753
if self.start < self.end {
jne .LBB0_1
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpextrd r8d, xmm8, 2
add r8d, esi
vpshufd xmm3, xmm8, 241
vpblendd xmm3, xmm3, xmm7, 8
vpaddd xmm2, xmm3, xmm2
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 234
self.a = a.wrapping_add(self.a);
mov dword ptr [rdi], r8d
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpbroadcastd xmm3, xmm12
vpshufd xmm4, xmm7, 198
vpblendd xmm3, xmm4, xmm3, 8
vpaddd xmm1, xmm3, xmm1
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 234
self.a = a.wrapping_add(self.a);
vmovdqu xmmword ptr [rdi + 4], xmm2
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpbroadcastq xmm2, xmm9
vpshufd xmm3, xmm12, 219
vpblendd xmm2, xmm3, xmm2, 8
vpaddd xmm0, xmm2, xmm0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 235
self.b = b.wrapping_add(self.b);
vmovdqu xmmword ptr [rdi + 20], xmm1
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vmovd esi, xmm9
add esi, edx
add r9d, ecx
vpextrd ecx, xmm9, 2
add ecx, eax
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 236
self.c = c.wrapping_add(self.c);
vmovdqu xmmword ptr [rdi + 36], xmm0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 237
self.d = d.wrapping_add(self.d);
mov dword ptr [rdi + 52], esi
mov dword ptr [rdi + 56], r9d
mov dword ptr [rdi + 60], ecx
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 238
}
ret
Fully optimized, modern compiler, ams output.
And what about the old MSRV compiler with the new code? With fully optimized asm output, the 1.63.0
version of rustc
is partially using SIMD instructions with the new code. It is better than the initial naive version, but the quarter round is not using SIMD and generally the asm is using more instructions than the most optimized version of the code above. This isn’t ideal, but might be all the juice we can squeeze out of the older compilers.
That leaves just one question: is it unusual for a library to require the target-cpu
flag to get SIMD instructions? Using LDK’s ChaCha20 implementation as a benchmark, it does not use SIMD instructions without at least the opt-level=3
flag. And the quarter round is also not vecotorized without target-cpu
. So perhaps the target-cpu
flag is a necessity for now until the usage of Rust’s native SIMD types is possible and the compiler gets even smarter.