Correct. Apple Silicon GPU performance should be equally fast in llamafile as it is in llama.cpp. Where llamafile is currently behind is at CPU inference (only on Apple Silicon specifically) which is currently going ~22% slower compared to a native build of llama.cpp. I suspect it's due to either (1) I haven't implemented support for Apple Accelerate yet, or (2) our GCC -march=armv8a toolchain isn't as good at optimizing ggml-quant.c as Xcode clang -march=native is. I hope it's an issue we can figure out soon!