More Power
The compiler picks a conservative target CPU by default so that the code is portable. But I just got an idea. What happens if you duplicate the function and mark the other one with target_feature and then use runtime detection to select the proper implementation? Maybe the compiler understands that one of them can be optimized more aggressively?
Kix
Feedback on the chacha20 improvement merge request got me thinking there might be some more intermediate steps to squeeze out performance before full blown manual SIMD. I’ll call the current state “custom portable”. The standard library is getting its own portable SIMD stuff soon, but not quite yet. It is portable in the sense that it only has to be written once by the developer and the compiler ports it (efficiently) to all target CPUs.
The performance benchmarks are run with RUSTFLAGS='--cfg=bench' cargo bench
(or RUSTFLAGS='--cfg=bench -C target-cpu=native' cargo bench
to enable the locally supported CPU features) on a cargo from the nightly channel (cargo’s builtin benchmarker is only available on the nightly channel at the moment). Here are the stats on my x86_64 machine before any new tweaks.
test benches::chacha20_10 ... bench: 98.31 ns/iter (+/- 0.15) = 102 MB/s
test benches::chacha20_1k ... bench: 1,407.56 ns/iter (+/- 28.87) = 727 MB/s
test benches::chacha20_64k ... bench: 83,592.20 ns/iter (+/- 1,156.83) = 783 MB/s
Custom portable version with no target-cpu=native flag.
test benches::chacha20_10 ... bench: 94.56 ns/iter (+/- 3.19) = 106 MB/s
test benches::chacha20_1k ... bench: 1,309.67 ns/iter (+/- 22.95) = 782 MB/s
test benches::chacha20_64k ... bench: 78,550.31 ns/iter (+/- 646.54) = 834 MB/s
Custom portable version with target-cpu=native flag.
test benches::chacha20_10 ... bench: 82.91 ns/iter (+/- 0.38) = 121 MB/s
test benches::chacha20_1k ... bench: 1,174.07 ns/iter (+/- 2.05) = 872 MB/s
test benches::chacha20_64k ... bench: 69,780.25 ns/iter (+/- 222.24) = 939 MB/s
Experimental stdlib version with target-cpu=native flag.
I did double check and make sure SSE2 SIMD things (mm, xmm, ymm) are used even with no target-cpu=native
flag. This makes sense since the older, less efficient, SSE2 instructions can be assumed for any x86_64 architecture, so always available to the compiler.
- x86 (32-bit) // SSE2 is optional, introduced with Pentium 4 in 2001.
- x86_64 (64-bit) // SSE2 is mandatory, part of the architecture definition. Just like NEON in the ARM AArch64 world.
Nice to see that the custom portable version makes use of them no matter what. My laptop has the more efficient AVX2 instructions enabled so theoretically faster! I believe that explains the minor bump with the target-cpu=native
flag set since AVX2 instructions are in fact emitted by the compiler. The goal of target-cpu=native
is to blanket enable all features the local CPU supports (e.g. enables things like -C target-feature=+avx2
).
So my initial hope to improve this, based on Kix’s ideas above, is to add an artificial branch for the compiler. One branch just goes to the custom portable as it is today, while the other is marked for x86_64
architecture only using the target_arch
check. The thinking is that the compiler can optimize the x86_64 version further since it knows it will only use x86_64 instructions, even without the target-cpu=native
flag set. This would make it more performant without limiting portability (binaries are generally shipped per-architecture).
Easy enough to just fire it up and see what happens.
impl State {
// Add x86_64-specific version.
#[cfg(target_arch = "x86_64")]
#[inline(always)]
fn chacha_block_x86_64(&mut self) {
let mut working_state = self.matrix;
for _ in 0..10 {
working_state = Self::double_round(working_state);
}
// Add the working state to the original state.
(0..4).for_each(|i| {
self.matrix[i] = working_state[i].wrapping_add(self.matrix[i]);
});
}
}
impl ChaCha20 {
// Update keystream_at_block to branch and use the architecture-specific version if it can.
#[inline(always)]
fn keystream_at_block(&self, block: u32) -> [u8; 64] {
let mut state = State::new(self.key, self.nonce, block);
#[cfg(target_arch = "x86_64")]
{
state.chacha_block_x86_64();
}
#[cfg(not(target_arch = "x86_64"))]
{
state.chacha_block();
}
state.keystream()
}
}
Just checking if explicit compile-time architecture checks help, but no performance impact.
Aaaaand, no change. I guess it would have been surprising if this had an impact, because the compiler is already optimizing a lot of things under the hood for an architecture. The target_arch
is implicitly set by the compilation target which defaults to the local machine. So the takeaway, it looks like these explicit branches don’t add enough info to push the compiler to emit the more performant AVX2 SIMD instructions (vmovdqa, vpaddd, vpxor, vpshufd, vpshufb, vprold) unless the target-cpu=native
flag is set, enabling the avx2
feature.
But this was just at the high level target_arch
which is already implicitly known. Does it help to drop further down to a specific SIMD feature of the CPU like avx2
? Features like target_feature = "avx2
need to be explicitly enabled at compile time, exactly what the target-cpu=native
flag is doing. This creates a less portable executable because it requires avx2 to run. We want both worlds, a portable executable which still includes specific CPU features.
Runtime Feature Detection
We want to conditionally use some super optimized code, but the conditional check must come at runtime in order to avoid all the granular, non-portable binaries. Runtime feature checks are not as clean cut as their compile time cousins. Checking if a feature is available isn’t free, and could be a performance hit if it’s in a super hot section of code (very possible in cryptography). This also introduces extra branches in the code, larger executable, more (difficult) things to test, and they may even throw off CPU branch prediction optimizations.
With that said, how does one go about attempting this? The #[cfg(...)]
compile-time checks used above, configuration predicates, check if a certain configuration is set. Rust has another feature, feature enabling attributes, which manually flip-on a feature for a code block #[target_feature(enable = "...")]
. This isn’t magic, the code must be marked as unsafe
since the compiler is giving up responsibility for checking if it will actually run. It is on the developer now to add some runtime checks to ensure this code is only hit when run on machines which support it. The standard library offers some help here, I am gonna stick to trying to get avx2 instructions emitted for now.
#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")]
unsafe fn avx2_function() {
// ...
}
Save users any confusing compile-time errors by wrapping feature enabling attributes with configuration predicates for the feature.
OK, so updating the compile-time tweaks from above to use feature enabling attribution and runtime detection instead.
impl State {
#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")]
#[inline]
unsafe fn chacha_block_avx2(&mut self) {
let mut working_state = self.matrix;
for _ in 0..10 {
working_state = Self::double_round(working_state);
}
// Add the working state to the original state.
(0..4).for_each(|i| {
self.matrix[i] = working_state[i].wrapping_add(self.matrix[i]);
});
}
}
impl ChaCha20 {
#[inline(always)]
fn keystream_at_block(&self, block: u32) -> [u8; 64] {
let mut state = State::new(self.key, self.nonce, block);
#[cfg(target_arch = "x86_64")]
{
// Runtime feature detection with explicit branching.
if is_x86_feature_detected!("avx2") {
unsafe {
state.chacha_block_avx2();
return state.keystream();
}
}
}
state.chacha_block();
state.keystream()
}
}
Feature enabling attributes with runtime feature detection dispatching.
Some things to know here. #[inline(always)]
cannot be combo’d with feature attribution. I don’t know the exact scenarios that could be troublesome, but a function may use instructions that the surrounding code’s feature set doesn’t include. Inlining would mix these incompatible instruction sets, so the compiler just avoids it all together. I think I’ll have to iterate on the best way to organize things if this strategy proves worth while. Same goes for this unsafe
usage. Since unsafe code loses a lot of the compiler’s memory safety help, it needs to be used on small a scope as possible, but that seems at odds with the goal here.
Now what is cool is that this does in fact emit avx2 instructions without manually enabling the feature at compile time! No -C target-feature=+avx2
or target-cpu=native
. But checkout these performance numbers.
test benches::chacha20_10 ... bench: 106.44 ns/iter (+/- 5.25) = 94 MB/s
test benches::chacha20_1k ... bench: 1,584.45 ns/iter (+/- 135.14) = 646 MB/s
test benches::chacha20_64k ... bench: 88,106.15 ns/iter (+/- 2,732.53) = 743 MB/s
Naive runtime checking is a performance hit!
Ouch, a step backwards. I think two things are going on here. The first is that is_x86_feature_detected!("avx2")
isn’t free, there is some overhead. And second, while avx2 instructions are being emitted, there is not that many. Although the feature enabling is happening in the top function of a bunch of work, it doesn’t appear to be applied to the internal calls.
But this is pretty cool. Now, how can this be captured keeping some requirements in mind?
- Share as much of the portable logic as possible. Copying and pasting every function for every feature would be a maintenance nightmare.
- Feature enabling paths need to be tested along with their portable counterparts.
- The portable SIMD
u32x4
type is coming to the standard library, keep the custom version useful and compatible in order to take advantage.
Organization
I want to avoid introducing SIMD intrinsics right away, the stuff which looks like _mm_set_epi32
where you are writing the SIMD by hand. I think the code can first be refactored to let the compiler dip into the more performant instructions, and perhaps then chunks could be re-written by hand with intrinsics for super optimizations.
The code currently has three layers of abstractions. At the bottom there is the custom portable U32x4
type with the bare-minimum supported operations (e.g. wrapping add). Next up is the internal State
which has the chacha specific operations like the quarter round. And at the top is the external ChaCha20
interface which also holds the secret material of the session.
I want to keep the U32x4
layer and interface no matter what, so that it can be swapped with the experimental std library version when ready. This will bring a lot of built-in, easy to maintain, optimizations to the worst case scenario (portable). As far as SIMD-able logic goes, I believe most of it is wrapped up in the U32x4
type and the State
. The top level ChaCha20
does do some work applying a keystream, but it is not a relatively “hot” section of code. I believe I want to still keep the U32x4
and State
interfaces safe even with optimizations. The unsafe runtime dispatching will just be an implementation detail. My hope is that this keeps the unsafe code scope’d as small as possible. But hopefully isn’t a performance hit crossing function boundaries.
Register Spilling
Caching Feature Detection
In any case, one question is how to cache the feature detection calls, like is_x86_feature_detected!
and is_aarch64_feature_detected!
(ARM version was added in Rust 1.59.0 so just makes the MSRV cut), which apparently are not that cheap. By why does the standard library not cache them itself?
```rust
use std::sync::atomic::{AtomicU32, Ordering};
use std::sync::Once;
static CPU_FEATURES: AtomicU32 = AtomicU32::new(0);
static INIT: Once = Once::new();
fn get_feature(feature_bit: u32) -> bool {
if CPU_FEATURES.load(Ordering::Relaxed) == 0 {
INIT.call_once(|| {
let mut features = 0;
if is_x86_feature_detected!("sse2") { features |= 1 << 0; }
if is_x86_feature_detected!("avx2") { features |= 1 << 1; }
// etc.
CPU_FEATURES.store(features, Ordering::Relaxed);
});
}
(CPU_FEATURES.load(Ordering::Relaxed) & feature_bit) != 0
}
Which SIMDs?
Also, what features are worth detecting? Maybe also RISC-V Vector Extension (RVV) and/or WebAssembly SIMD128?
| Architecture | Extension | Status | Introduction Date | Current Prevalence |
|--------------|-----------------|-----------|---------------------------|-----------------------------------|
| x86_64 | SSE2 | MANDATORY | 2001 (Pentium 4) | 100% of x86_64 |
| x86_64 | SSE3 | Optional | 2004 (Pentium 4) | >99% of CPUs from 2004+ |
| x86_64 | SSSE3 | Optional | 2006 (Core 2) | >99% of CPUs from 2006+ |
| x86_64 | SSE4.1/4.2 | Optional | 2008 (Core i7) | >95% of CPUs from 2008+ |
| x86_64 | AVX | Optional | 2011 (Sandy Bridge) | ~90% of CPUs from 2011+ |
| x86_64 | AVX2 | Optional | 2013 (Haswell) | ~80% of CPUs from 2013+ |
| x86_64 | AVX-512 | Optional | 2016 (Xeon Phi) | <20% of CPUs from 2016+ |
| ARM32 | NEON (ARMv7) | Optional | 2009 | ~95% of ARMv7 CPUs |
| ARM64 | NEON | MANDATORY | 2011 (ARMv8) | 100% of ARM64 |
| ARM64 | Crypto Exts | Optional | 2011 (ARMv8) | ~90% of ARM64 |
| ARM64 | Dot Product | Optional | 2016 (ARMv8.2) | ~60% of ARM64 from 2017+ |
| ARM64 | SVE | Optional | 2016 | <10% of ARM64 (mostly HPC) |
| ARM64 | SVE2 | Optional | 2020 (ARMv9) | <5% of ARM64 (newest) |
Rough idea of SIMD features across the Intel/AMD and ARM worlds.
Testing
Can QEMU (Quick EMUlator) be used to logically (not performance) test different architecture code?