Hacker News new | past | comments | ask | show | jobs | submit login

> BLAKE is faster when using software to perform the hash

Is BLAKE 3 still faster than sha-256 when using the cpu speciliazed instructions? I think most modern desktop CPUs has built-in instructions for SHA256.

I’m guessing when people compare BLAKE 3 to SHA 256 they’re comparing software to software, but this wouldn’t be the case in reality?




I haven’t seen any benchmarks for BLAKE3 vs. the Intel/AMD SHA extensions. My guess is that Intel hardware accelerated SHA-256 will be faster than BLAKE3 running in software for most real world uses.

I can tell you this much: It is only with Ice Lake, which was released in the last year, that mainstream Intel chips finally got native hi speed SHA-NI support. Coffee Lake and Comet Lake, which are still the CPUs in a lot of new laptops being sold right now, do not support SHA-NI.


AMD Zen supports SHA extensions across all SKUs. Here are `openssl speed` numbers on an AMD EPYC 3201:

  type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
  blake2s256       46720.33k   187461.21k   305314.65k   373840.55k   398207.66k   401528.15k
  blake2b512       38423.44k   155318.81k   422325.08k   592401.75k   674843.31k   681743.70k
  sha256           84620.44k   279840.47k   723573.76k  1199678.81k  1484693.50k  1510484.65k
  sha512           33854.38k   135674.20k   275343.70k   444872.36k   545802.92k   554166.95k
  sha3-256         26146.35k   103860.27k   253944.92k   308119.21k   347477.33k   351906.47k
  sha3-512         26349.83k   105590.85k   144236.03k   173082.62k   189448.19k   189814.10k
It's possible that Blake3 might be faster than accelerated SHA-256 on large inputs, where Blake3 can maximally leverage its SIMD friendliness. OTOH, Blake3 really pushes the envelope in terms of minimal security margin. Performance isn't everything. SHA-3 is so slow because NIST wanted a failsafe.

OpenSSL info:

  OpenSSL 1.1.1c  28 May 2019
  built on: Tue Aug 20 11:46:33 2019 UTC
  options:bn(64,64) rc4(8x,int) des(int) aes(partial) blowfish(ptr) 
  compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-D7S1fy/openssl-1.1.1c=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
NOTE: /proc/cpuinfo shows sha_ni detection, and the apt-get source of this version of OpenSSL confirms SHA extension support in the source code, but I didn't confirm that it was actually being used at runtime.


Assuming Blake3 will be across the board 43% faster (7 instead of 10 rounds) than 32-bit blake2s256, we would get:

  Blake3  SHA-256
   66743    84620  Tiny
  534057  1199679  Medium (1024 bytes)
  573611  1510485  Largeish (16384 bytes)
This is based on the parent’s numbers with a fudge factor to account for Blake3 being a faster version of blake2s256 (i.e. the 32-bit version of Blake2 which is the only version in Blake3)

Of course, this does take in to account that Blake3 has tree hashing and other modes which scale better to multiple cores.

(Edit: update figures; I need to scale up Blake2s256 not Blake2b512)


The BLAKE3 tree mode also takes advantage of SIMD parallelism on a single core, which ends up being a larger effect than the reduced number of rounds. At 2-4 KiB of input (depending on the implementation) it's 2x faster than BLAKE2s on my laptop. Where AVX2 and AVX-512 are supported, those kick in at 8 KiB and 16 KiB of input respectively, widening the difference further. The red bar chart at https://github.com/BLAKE3-team/BLAKE3 is a single-threaded measurement on a machine that supports AVX-512.


On my machine with sha extensions, blake3 is about 15% faster (single threaded in both cases) than sha256.


Also, Blake3 has some kind of advantage in parallelizability, iirc.


yeah, blake3 multi-threaded is about 11 times faster for me than sha256 single-threaded.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: