I wrote the BootROM and 2nd stage loader used in VC3 and VC4 - way back in ~2005. And the interrupt controller, VPU sync block and various other things used in this open FW.
This is pretty awesome to see how much context the community has reverse engineered from the design - I remember making the first BCM2835 based project (the Roku 2) and spending _A LOT_ of time getting the ARM to startup reliably and boot into Linux as fast as possible. I've spotted 2 concepts in the open FW so far that misunderstood the implementation of the HW but are still working which is fun!
One advantage of the having the GPU start first (maybe the only advantage :) is that it can play a video for a splash screen instead of a static image. If you've ever used a Roku 2/3 or newer, this is a feature I hacked together for a demo to hide boot-up latency - now its a standard part of RokuOS (and is quite hard to replicate on traditional ARM/MIPS SoC's).
I'm guessing you had privileged access to the datasheets for those chips and you're not legally allowed to tell them what they misunderstood? I imagine I would feel super weird seeing something being done wrong but not being allowed to tell someone. I mean I understand why, but it'd be frustrating.
Well, I wrote the RTL as well as implemented the SW so I guess I do have propriety information on the topic...
In theory, all the code and sequences for the BootROM, 2nd stage, ARM loader and peripheral control functions is available by disassembling the binaries (and extracting some code from the ROM with a mempcy) - it just requires reverse engineering / re-implementing inside and this is part of the fun / why this project is so great.
I might suggest they poke at the linux port and u-boot of the BCM2835 from Roku however :)
Been on the other side of that working on the HTC Linux project and chatting with Android developers at Google. I knew they were laughing to themselves as we tried to figure out what were modem differences and what were SoC differences between MSM7x00A and non-A.
The Raspberry Pi is really a VideoCore IV processor with an ARM bolted on the side. At power on, a boot ROM loads an embedded OS image into the VC4 processor. That then sets up the hardware, powers up the ARM, and loads the OS image into that. The VC4 then continues running in the background servicing RPCs from the ARM.
What this is is a replacement VC4 operating system image which just fires up the ARM with an embedded image, sets it running, and then halts (assuming I'm understanding the docs correctly); so it can't (yet) actually load an OS into the ARM from disk or respond to RPCs etc.
In practical terms, this is the biggest, most difficult step forward towards running an almost completely open system on the RPi. (It could also be used as the core of a proper operating system running on the VC4 itself, which could be rather interesting.)
In practical terms, this seems to be quite a long way from running a completely open system - the VC4 is responsible for initializing a whole bunch of critical and I believe undocumented hardware that's required to enable any of the peripherals, and this doesn't. Basically, a whole bunch of stuff that'd be implemented by drivers on (say) most Allwinner SoCs I've looked at instead relies on the blob. This is probably part of the reason the Raspberry Pi has relatively good mainline kernel support - they don't have to deal with months of LKML hell to get basic things like clocking and power management working every time they release a new chip because that's all handled in the blob.
I've written some bare-metal stuff on the Pi (somewhere I've got a 90% finished port of Fuzix for it), and oh god yes, that's so true. Being able to do complex stuff like set up a framebuffer by making a simple RPC call to the VC4, passing in a pointer and a descriptor block saying 'this big, please', is so much better than having to do it yourself.
But this firmware doesn't contain code to initialise the GPU, it does contain initialisation code for one rather important peripheral, which is the DRAM. The VC4 has been pretty well reverse engineered for a while (I did one of the early C compiler ports for it), but the big missing piece was figuring out how to initialise the RAM. Without that, everything had to fit in the 128kB of built-in SRAM.
> The Raspberry Pi is really a VideoCore IV processor with an ARM bolted on the side.
I knew that the bootloader worked that way but didn't think of it being that way until reading your comment. Interesting to think of the hierarchy that way.
Naively this makes me think that the ARM ISA could use something like UEFI (or BIOS) to unify the devices' bootstrapping logic around some common metaphor.
Well, the ARM is literally powered down until the VC4 turns it on.
(I've heard that the only reason the ARM is there at all is that Broadcom had a royalty-free license and there was some spare silicon, so they thought, hey, why not, it might be useful...)
Traditionally ARMs have never used any kind of common boot process because they come out of the embedded world where every system is bespoke --- the focus is always on individual products, rather than building a platform.
There have been various efforts to fix this but AFAIAA they've never come to much. I think the current one is DeviceTree, but I don't believe there's much vendor support.
The VC4 was intended for use in set-top-boxes or TVs, where the ARM would run a regular lightweight OS to drive the UI menus and the VC4 would do the video decoding, and compose the UI onto the video. It's not an accident, but it is a second-class citizen.
> Traditionally ARMs have never used any kind of common boot process because they come out of the embedded world where every system is bespoke --- the focus is always on individual products, rather than building a platform.
This is something I did not know. In the mobile world the boot happens in the UEFI mould right? So when someone licenses from ARM they get to design their own boot sequence?
Mobile doesn't have UEFI either. Usually what happens is the processor has some minimal boot ROM that can initialise the Flash (and enforce signed boot!), and then it just loads a chunk of Flash into RAM and jumps to it.
Well, it's two different use cases. PCs are platforms, and they have common components and optional peripherals. Embedded systems are completely custom designs that rarely have common components. Everything is driven by BOM cost, and if you don't need something, you don't add it.
Booting such a system is highly dependent upon the board configuration and the peripherals that are needed. Furthermore, vendors have different boot-up requirements. Some boot loaders perform proprietary system checks before loading the RTOS. Some boot loaders need to be able to upgrade firmware from an image on-chip, in flash, or fed into it over SPI or a UART. Some boot loaders are little more than a jump table decoder that just jumps to the currently active firmware image on the chip. Creating a universal boot loader with enough flexibility for each of these situations is harder than just implementing a boot loader for a particular electronics product or product family.
There's a bit of a trend to embed a tiny ROM in the processor itself that's got just enough intelligence to bootstrap an image off various peripherals.
The Broadcom parts work like this for the VC4; the boot ROM can talk to an SD card, parse a FAT filesystem, and load and run the second-stage loader (bootcode.bin). But Allwinner chips do this too, and they're even smart enough to try several different media types (including SATA, IIRC!). Ditto Tegra parts.
I imagine that if you're a major customer you get to choose the contents of the boot ROM.
From a hacker perspective, it's brilliant, because the devices are completely unbrickable. It doesn't matter how broken they get, you can't change the ROM, which means that you always have ability to get something working. But, of course, they all have entirely different APIs. The VC4 just dumps an image into memory, sets a couple of registers, and jumps to it, and I imagine the others all do the same thing.
The only upside of this is unbrickability. Having the boot ROM inside the CPU means you are stuck forever with whatever bugs there are in it. Been there.
I don't think such a thing exists. Aren't dies made as small as possible to increase yield? I know the Cell processor in the ps3 had a spare spe core to significantly increase yield.
The ARM world is rapidly moving towards UEFI, partly thanks to Microsoft insisting there be something like it for them to build on.
Even better, u-boot implements enough of it to boot a UEFI operating system. So on some new devices you'll get UEFI firmware, on everything else you can flash u-boot with UEFI support... and it's now part of their default configs.
See your mobile phone: it depends whether the manufacturer makes that an option. In a processor I'm familiar with (imx53), once it's turned on by a processor fuse it stays on - it's then up to the second stage boot loader (provided by the device OEM) to decide whether to load untrusted images.
Linux (and probably *BSD, not sure if they support it) can use devicetree files to configure itself for the hardware it's running on. No runtime firmware necessary.
Everyone says this setup where the VC4 boots the ARM is weird, but isn't that pretty similar to how Intel ME boots the x86, right down to the proprietary firmware and everything?
It doesn't seem like an unusual setup to me at all: the only unusual thing, I guess, is that the VC4 is capable of a lot more than the small ARC/ARM core inside Intel ME.
The main weirdness is that the VC4 is more of a GPU than a secondary CPU. It'd be like an Nvidia graphics card bootstrapping the actual CPU in an ordinary desktop.
That's not quite right. The Pi has a completely separate GPU unit. The VC4 is different, and is a pretty normal processor, although it's got a pile of DSP addons.
- 32 registers which can hold either integers or 32-bit floats;
- ARM-like 32-bit 3op instructions, with a limited set of 16-bit 2op instructions;
- integrated 32-bit floating point instructions, using the same registers as everything else;
- some 48 bit instruction forms (allowing a full 32-bit payload! No ghastly ARM constant pools or PowerPC-style split payloads!);
- two cores, with integrated interrupt handler;
- integrated DSP with 80-bit vector instructions working on a 64x64x8bit vector working area, which TBH looks like it was stolen from another processor completely and which I frankly don't understand;
- ARM-style conditional execution (for some instructions);
- ARM-style multiregister loads and saves (and pushes and pops);
- DSP-style saturated arithmetic and fixed-point support;
- no ALU-setting-condition-flags instructions, or add-with-carry or subtract-with-carry operations, which makes 64-bit arithmetic really painful (if you need cmp to set the condition flags, what's the carry flag even for?)
It's actually a really nice thing to write assembly for. It's orthogonal enough to be understandable, while quirky enough to allow some really satisfying optimisations. e.g. there's the addcmpb instruction, which while add a value to a register, compare with another register, and branch based on a condition code, all in a single 32-bit operation --- it's basically a loop in a box.
It's also the only time I've ever seen 6-bit floating point constants...
It was my first job out of University to design this instruction set which may explain some of the quirkiness...
Initially the instructions did all set the status flags but it caused a tight feedback loop in the processor. The choice was between a higher clock frequency for all instructions or better 64-bit arithmetic.
None of the initial video applications needed 64-bit support so it lost out, although I did get to put in the divide instruction just so my Doom port could run faster :)
Are you allowed to tell us what the C compiler used internally was based on? I know there are some very easy to port proprietary compilers which commonly don't see the outside world, and I'm wondering whether it was one of those, or whether some poor sucker had to port gcc.
We paid a company called Metaware to make a compiler. I believe this compiler is still in use.
As it happens, while we were waiting for this compiler to be made for us, I ported GCC to the architecture for my own use. I don't remember it being all that painful, just a few pages of machine description and everything seemed to work fine.
This only supported the scalar instruction set. However, when we needed an MP3 decoder I found that it really needed 32bit precision to meet the audio accuracy, so I also made a different port of gcc that targeted the vector processor. I changed the float data type so any mention of float actually represented 16 lanes of a 16.16 fixed point data type implemented on the vector processor. From what I recall, mp3 decode required 2MHz of the processor for stereo 44.1kHz.
Agreed that this is odd. But that's just a difference in the capabilities of the bootstrap core: Intel uses a less powerful ARC or ARM core, while Broadcom uses a more powerful core that also functions as a GPU. The mechanism is the same.
There isn't a fundamental difference in how Intel CPUs start up and how the Raspberry Pi starts up, as I sometimes see people implying.
I read through the wiki page for videocore but still did not get a complete picture of what this is. Is it a GPU? If its a GPU or something that specializes in processing video/audio why is it in charge of the boot process? Doesn't it usually happen the other way around, the boot ROM loads the host OS which in turn initializes the other firmware?
It's a vector processor and a GPU. The vector processor is intended for video decoding and therefore has a very big SIMD capability. It has some nice gimmicks like the 'square' register file: you can access rows or columns for SIMD operations. The VPU is I believe what boots first and normally runs an RTOS called "ThreadX".
In particular we are talking about "Scalar/Vector (VPU)" "Dual Core VideoCore IV® Multimedia Co-Processor" (Figure 3B). As far as I understand, that's the thing that boots the device and usually runs ThreadX OS in the closed-source bootloader. The VPU is a dual core processor with scalar and integer vector instructions.
The VPU is more-or-less independent of the GPU, although I'm not sure whether the VPU gets involved with some GPU scheduling tasks.
The GPU has four "QPU" QuadProcessor pipelines, which from memory can do floating-point vector processing.
Kinda, but with x86 the BIOS code is executed on the x86 processor itself and before that, the processor bootstraps itself. Here, the VideoCore processor (the GPU iiuc) is responsible for bootstrapping the CPU.
It replaces the non-free binary blob on the Raspberry Pi that runs right after reset and initializes the processor. Modern complex processors require a specific and carefully orchestrated series of initialization steps to configure system clocks and ensure external RAM and other peripherals work correctly.
Of course no one in their right mind would use this reverse-engineered code here for any serious purpose. Instead, you dump the bastardized RPi platform (it's basically a video graphics card with an ARM bolted on) and use something actually free, like the BeagleBone.
>and use something actually free, like the BeagleBone.
Oh please, while you lose the VC4 firmware, you instead get a proprietary PowerVR GPU with no free drivers. Not to mention that the Beaglebone has a much worse CPU than the "bolted on ARM" of the RPi.
If you're going to dump the rpi for this reason, going to something like the imx6 used in the Cubieboard and Novena would make a lot more sense (edit: Cubox, not Cubieboard).
> If you're going to dump the rpi for this reason, going to something like the imx6 used in the Cubieboard and Novena would make a lot more sense.
The Cubieboard uses an Allwinner SoC, which has a bad reputation among oppen source people since Allwinner violated the GPL multiple times in the past (though I heard the Linux support is decent; mainly because of the work of the Sunxi community: https://linux-sunxi.org/Main_Page). Novena (and "Sabre Lite - i.MX6 Quad Core"; cf. https://news.ycombinator.com/item?id=12743798) indeed use a Freescale (now bought by NXP) SoC.
Unfortunately NXP is nuking most of the iMX6 support from orbit after they bought Freescale. Even things like most of the SD card images for their SABRE boards were deleted of their site. It's only due to the large open ecosystem around the iMX6 that we were able to debug some issues while bringing up a board recently.
Same thing happened with Allwinner's SOCs, when the one point of contact that was providing the source of their modified kernel left, Allwinner chipsets stopped getting rapid mainline kernel support.
The silver lining though is that the sunxi community has been working to mainline newer chips like the Allwinner H3 & has been doing a bang up job with minimal info, newer kernels can boot up & get most hardware onboard configured & ready to use.
Intel GPUs require neither a blob nor reverse engineering, so ironically the Minnowboard (http://wiki.minnowboard.org/MinnowBoard_Wiki_Home; current revision is "Minnowboard Turbot") is very open (also the UEFI implementation (TianaCore) is open source; just a small FSP (Firmware Support Package) by Intel containing binary data has to be compiled in - as far as I know it contains no suspicious data; start at https://firmware.intel.com/projects/minnowboard-max if you want to read about the details). I am rather sure this processor/board has no support for the dreaded Intel AMT.
If you want an ARM board: The "Sabre Lite - i.MX6 Quad Core" board is the nearest to this ideal that I know of. Traditionally Freescale has been very open with the specifications (though after NXP bought them this changed to worse). The GPU (Vivante™ GC2000) is not strictly in the category "No binary blobs, no reverse engineering", but I have read that this is among the GPUs found on ARM boards by far the most easy one to reverse engineer, thus there seems to exist a decent reverse-engineered driver:
The K1 chip has a blob-free boot (unlike the newer chips AFAIK) except for the USB-based flash programming / recovery mode, so for the initial installation you'd need to manually flash the eMMC somehow if you don't want to run any proprietary code.
And yes, CUDA isn't supported by the open drivers. Other than that, I've heard Nouveau works relatively well. Though I'm not still sure if the open source graphics community has figured out a way to get stacks like Xorg or Wayland/Weston running without hacking the source on architectures where rendering and scanout are done by different DRM devices.
I have a vague hope that one day the dual-core VPU on the Pi will be usable for time-critical real-time i/o in much the same way as the PRUs on the BBB are used.
NuttX would be interesting on a raspberry pi. Much lighter weight then Linux but still implements POSIX like interfaces. The POSIX stuff always felt to clunky for a microcontroller, but rPi might be just right.
While a port has been discussed, I don't think a port has been completed.
I wonder what sort of practical benefit there is for the "lighter weight" of NuttX? I imagine the main attraction is realtimedness, but I think that is sort of orthogonal to light-weightness. And even then it has rtlinux to compete with.
This is pretty awesome to see how much context the community has reverse engineered from the design - I remember making the first BCM2835 based project (the Roku 2) and spending _A LOT_ of time getting the ARM to startup reliably and boot into Linux as fast as possible. I've spotted 2 concepts in the open FW so far that misunderstood the implementation of the HW but are still working which is fun!
One advantage of the having the GPU start first (maybe the only advantage :) is that it can play a video for a splash screen instead of a static image. If you've ever used a Roku 2/3 or newer, this is a feature I hacked together for a demo to hide boot-up latency - now its a standard part of RokuOS (and is quite hard to replicate on traditional ARM/MIPS SoC's).