SDL is simply not optimised for this use-case, but that doesn't mean it can't be done. If you're writing single-threaded code (which I assume BBC Basic is!), you can cut out the locking and directly poke a 640x480x3 array in memory. This part is extremely fast, as fast as your memory subsystem can go.
Then, convert that into a texture and send it to the GPU once per frame. This is the only added overhead relative to the old ways. If you picked the right format for your in-memory buffer (probably ARGB-8888, perhaps with a certain row stride) then that conversion is a nop. The "correct" format depends on your hardware (which is why this isn't a typical workflow), but even a non-trivial pixel format conversion is fast at 640x480.
If you wanted to send a 3840*2160 texture to your GPU at 60Hz, that requires "only" 2GBps of bandwidth, which I think you'll find in most modern systems. This is pretty inefficient, which is why we don't do it, but it can be done.
I do think there's a lot more going on than that. Here's SDL's current pixel access code:
https://github.com/libsdl-org/SDL/blob/main/src/video/SDL_su...
To do a pixel read there's at least a lock, memcpy, format conversion and an unlock.