First off: Sorry for spamming this thread. I am basically hoping for some input during the various stages of investigation since i am pretty much operating in the outer most regions of my brain capacity here (see stupid error misreading the VIA flags for example). Anyways i have patched out the OpenSSL availability check by now and this is were it gets really weird. I was very much expecting it to crash but it doesn't (i haven't found a way to validate the accuracy yet so maybe the results are all garbage - i tend towards them being valid though).
So this is what happens when OpenSSL is patched to ignore the "non-availability" (RNG still isn't used as OpenSSL hardcodes this to disabled and i didn't bother changing that - in reality it's "detected" though):
default# /opt/bin/openssl engine -t padlock
(padlock) VIA PadLock (no-RNG, ACE)
[ available ]
default# /opt/bin/openssl speed -engine padlock aes-128-cbc
engine "padlock" set.
Doing aes-128 cbc for 3s on 16 size blocks: 757643 aes-128 cbc's in 2.87s
Doing aes-128 cbc for 3s on 64 size blocks: 197871 aes-128 cbc's in 2.87s
Doing aes-128 cbc for 3s on 256 size blocks: 50188 aes-128 cbc's in 2.86s
Doing aes-128 cbc for 3s on 1024 size blocks: 12581 aes-128 cbc's in 2.87s
Doing aes-128 cbc for 3s on 8192 size blocks: 1579 aes-128 cbc's in 2.88s
Doing aes-128 cbc for 3s on 16384 size blocks: 790 aes-128 cbc's in 2.87s
OpenSSL 1.1.1u 30 May 2023
built on: Wed Aug 16 10:40:31 2023 UTC
options:bn(64,32) md2(char) rc4(8x,mmx) des(long) aes(partial) idea(int) blowfish(ptr)
compiler: cc -fPIC -pthread -Wa,--noexecstack -O2 -O2 -g0 -fomit-frame-pointer --param max-early-inliner-iterations=2 -ffunction-sections -fdata-sections -march=esther -mtune=esther -mno-sse3 -fPIC -D_FORTIFY_SOURCE=2 -I/usr/include -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_PART_WORDS -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DRC4_ASM -DMD5_ASM -DRMD160_ASM -DAESNI_ASM -DVPAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -D_THREAD_SAFE -D_REENTRANT -DNDEBUG -I/usr/include
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128 cbc 4223.79k 4412.45k 4492.35k 4488.83k 4491.38k 4509.88k
default# /opt/bin/openssl speed -engine devcrypto aes-128-cbc
engine "devcrypto" set.
Doing aes-128 cbc for 3s on 16 size blocks: 756208 aes-128 cbc's in 2.87s
Doing aes-128 cbc for 3s on 64 size blocks: 198042 aes-128 cbc's in 2.87s
Doing aes-128 cbc for 3s on 256 size blocks: 50175 aes-128 cbc's in 2.86s
Doing aes-128 cbc for 3s on 1024 size blocks: 12580 aes-128 cbc's in 2.87s
Doing aes-128 cbc for 3s on 8192 size blocks: 1571 aes-128 cbc's in 2.86s
Doing aes-128 cbc for 3s on 16384 size blocks: 790 aes-128 cbc's in 2.88s
OpenSSL 1.1.1u 30 May 2023
built on: Wed Aug 16 10:40:31 2023 UTC
options:bn(64,32) md2(char) rc4(8x,mmx) des(long) aes(partial) idea(int) blowfish(ptr)
compiler: cc -fPIC -pthread -Wa,--noexecstack -O2 -O2 -g0 -fomit-frame-pointer --param max-early-inliner-iterations=2 -ffunction-sections -fdata-sections -march=esther -mtune=esther -mno-sse3 -fPIC -D_FORTIFY_SOURCE=2 -I/usr/include -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_PART_WORDS -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DRC4_ASM -DMD5_ASM -DRMD160_ASM -DAESNI_ASM -DVPAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -D_THREAD_SAFE -D_REENTRANT -DNDEBUG -I/usr/include
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128 cbc 4215.79k 4416.27k 4491.19k 4488.47k 4499.87k 4494.22k
So the padlock and devcrypto engines are pretty much identical in performance. Now the main question just is why. Are they both using padlock (devcrypto's detection routines seem to go only by padlock being "enabled" ignoring the fact that it's supposedly not "available") or are they both falling back to software somehow. Devcrypto is a little unpredictable in that regard as the preference between software and hardware codepaths seems to rely on the registration order but at least for OpenSSL i don't see any obvious fallback code. Any thoughts or clever ideas on how to proceed? @rvp maybe? 😉
Also interesting but possibly unrelated is the fact that NetBSD's OpenSSL implementation seems to be vastly superior to pkgsrc and the minimalistic optimization options i chose:
default# /usr/bin/openssl speed -engine devcrypto aes-128-cbc
engine "devcrypto" set.
Doing aes-128 cbc for 3s on 16 size blocks: 804262 aes-128 cbc's in 2.86s
Doing aes-128 cbc for 3s on 64 size blocks: 211741 aes-128 cbc's in 2.86s
Doing aes-128 cbc for 3s on 256 size blocks: 54170 aes-128 cbc's in 2.87s
Doing aes-128 cbc for 3s on 1024 size blocks: 21468 aes-128 cbc's in 2.86s
Doing aes-128 cbc for 3s on 8192 size blocks: 2712 aes-128 cbc's in 2.87s
Doing aes-128 cbc for 3s on 16384 size blocks: 1354 aes-128 cbc's in 2.88s
OpenSSL 1.1.1k 25 Mar 2021
NetBSD 9.3
options:bn(32,32) rc4(8x,mmx) des(long) aes(partial) idea(int) blowfish(ptr)
gcc version 7.5.0 (NetBSD nb4 20200810)
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128 cbc 4499.37k 4738.26k 4831.89k 7686.44k 7741.01k 7702.76k
Blocks with a size >= 1024 have almost double performance. I am sure what to make of this but i figure there is probably some clever optimization in NetBSD's build that i/pkgsrc miss.
Edit: Another interesting observation is that setting OPENSSL_ENGINES
to the location of the engine libraries built by pkgsrc openvpn (it's built against the default NetBSD OpenSSL libraries) is able to easly hit 15-16mbit/s. I am not sure if that's really the upper limit (my uplink is a mobile connection and quality varies widely - 30-35mbit/s is about the best i've ever archived but 20mbit/s should be doable rather regularly) but that seems to be a quite notable improvement vs. not being able to break the 12mbit/s barrier. CPU seems to beabout 10-20% lower than with the default engine too. It's wild guess but maybe OpenSSL's synthetic benchmark is bottlenecked by something else than raw encryption performance (maybe RAM?)?