Rust on the MOS 6502: Beyond Fibonacci

vaxman · on Sept 21, 2021

See also https://www.nordicsemi.com/News/2019/12/Rust-a-security-prog...

PS: Holy crap! For the first time in my 40+ year career I have clicked-thru from a semi-relevant article about Rust on a micro p̵r̵o̵c̵e̵s̵s̵o̵r̵ controller to a reference about the...[RCA] COSMAC VIP (in the form of this dude's effort to get CHIP-8 running on LLVM-MOS). Do you have any idea how many lawns I had to mow to buy one of those? It was a big disappointment (over my ELF and SuperELF) too! ROFL

[ https://youtu.be/fLVN05Jl6wA ]

zwirbl · on Sept 21, 2021

I guess when mentioning Rust on Nordic controllers one should also mention these excellent projects

https://github.com/embassy-rs/embassy https://github.com/embassy-rs/nrf-softdevice

Together with https://github.com/nrf-rs/nrf-hal these enable most everything one can do on these controllers form pure Rust (the softdevice is a blob with a C-SDK that's wrapped in rust though)

sagacity · on Sept 21, 2021

That is so cool. I saw some posts about LLVM-MOS a while ago, but at that point I thought it would be just another in a fairly long list of attempts to try and get LLVM to output 6502 instructions.

I never expected it to come together this well! Especially considering that the author of the article mentions there were so many issues with LLVM-AVR, you'd expect them to exist in LLVM-MOS as well. Apparently not! I guess the code quality will only improve from here on out, the loop at the bottom of the article does seem like it is not as optimal as it could be :)

mysterymath · on Sept 21, 2021

Up until just a few weeks ago, 100% of the codegen work we've put into LLVM-MOS has been to get it feature-complete and rock-solid. It's awesome to see that that work has paid off!

We're just now starting to really optimize the compiler; there's definitely a long road ahead of us, but our preliminary investigations suggest that we'll be able to get the thing to emit really quite good 6502 assembly.

Right now, it emits near-garbage in a large number of common cases, as seen in the article. This is mostly due to technical debt intentionally accrued while getting the thing working, though; we did stuff like use the default LLVM lowering for comparisons, which are ridiculously trash on the 6502. But there's only really a couple major technical hurdles left to overcome; everything else is just painstakingly teaching LLVM what the best 6502 assembly patterns are for various situations.

zozbot234 · on Sept 21, 2021

> We're just now starting to really optimize the compiler; there's definitely a long road ahead of us, but our preliminary investigations suggest that we'll be able to get the thing to emit really quite good 6502 assembly.

Is there any part of this optimization work that might be upstreamed to LLVM itself and benefit other architectures? Or is this stuff purely 6502-specific?

mysterymath · on Sept 21, 2021

Some of it might benefit AVR, which being 8-bit, shares some of the same problem space. But most of the changes we've made so far are of the kind where LLVM says "this doesn't happen", or "when it does, it's not important." And now the 6502 says, "uh, I actually do need that..."

So in absolute terms of maximizing the flexibility of LLVM, yes, the changes do seem to be broadly useful, but they're mostly in a direction that doesn't benefit most processors all that much.

For example, the 6502 really wants to replace stack usage with global usage; we do this absolutely whenever possible. Other targets actually run the opposite transformation; they replace global variables with stack ones! Placing things on the stack maximizes the chance it'll be in a fast CPU cache (or that it may be folded into a register; this does apply to us too.)

zozbot234 · on Sept 21, 2021

> For example, the 6502 really wants to replace stack usage with global usage; we do this absolutely whenever possible.

If I understand what you're getting at, this transformation comes up all the time when compiling either coroutines or user-space "green" threads. More obviously, it could expand the usefulness of LLVM for targeting very low-end microcontrollers (even "modern" ones targeting varieties of ARM or other recent architectures) where stack space, and memory more generally is often at a premium.

cmrdporcupine · on Sept 21, 2021

I haven't looked at this closely, but 6502 really doesn't lend itself to C compilation. Three registers, only one of which works with the ALU, awkward immovable stack, etc.

The 65816 is a better target (moveable direct page and stack and some wider registers), but also awkward with its register mode switching.

gergoerdi · on Sept 21, 2021

From what I understand, LLVM-MOS treats large parts of the zero page as virtual ("imaginary") registers, so you have no shortage of that (https://llvm-mos.org/wiki/Imaginary_registers). Then, sufficiently advanced compiler technology improves the stack situation (https://llvm-mos.org/wiki/C_calling_convention).

dhosek · on Sept 21, 2021

6502 assembly has the distinct advantage of having special page-0 instructions for reading/writing from memory, including, if I recall correctly, the ability to take a 2-byte sequence and treat it as a 16-bit value (or was that in the AppleSoft ROM?)

retrac · on Sept 21, 2021

The main way to do pointer indirection (without self-modifying code) is to use the zeropage-specific indirect addressing modes, which use a 2-byte address stored in zero page as a pointer to a byte in memory. (And on the original 6502, the only available addressing modes for this forced you to use the X or Y register as an index, so you had to set it to 0 first!)

sagacity · on Sept 21, 2021

You can treat 2 bytes (not just in the zero page, though) as indirect jump addresses, yes.

Doing something like "JMP ($2345)" will jump to whatever $2345/$2346 is pointing to.

dhosek · on Sept 21, 2021

It's a little amazing how much 6502 assembler sticks with me 35 years later.

But only a little. I didn't have the money to buy an assembler or the skill to write one so I would write out my programs in long-hand on graph paper and hand-assemble them before entering hex codes manually. While not the most efficient process, it did do a good job of encoding things into long-term memory.

sagacity · on Sept 21, 2021

Haha, yes, I can relate. I didn't do any 6502 coding for ~25 years and it mostly just stuck around. Apparently it's like riding a bike.

In the meantime I've forgotten most of the 68000 and z80 instruction sets.

dhosek · on Sept 21, 2021

I remember being in high school, reading K&R and trying to figure out how I could get a C compiler running on an Apple ][. Never did, but it was a useful intellectual enterprise.

My second (and last) assembly language after 6502 was 370 which replaces the "awkward immovable stack" of the 6502 with no hardware stack at all. Applications are completely responsible for maintaining their own call stack.

Someone · on Sept 21, 2021

Not only C, any language that thinks there’s other things than global state.

If all your functions are void foo(void) and you don’t use local variables (or your language doesn’t support recursion, in which case all locals can be given a fixed address), targeting 6502 is fine (it also helps if you avoid floating point, use 8-bit variables where possible, etc)

Not supporting recursion also means you can statically compute maximum stack depth. That way, you can avoid linking code that would overflow the stack.

sagacity · on Sept 21, 2021

The cool thing about LLVM-MOS specifically it that by using the zero page as virtual registers you sort-of get the same output with 'regular' code as opposed to this 'global variables' style of programming.

I recall a tutorial for 'cc65 optimizations'[0] which basically destroys a well-structured C program in order to do all of these optimizations (like making everything global) and it was absolutely terrible, code-wise. Well, the end result was probably fine, but it's just a shame these 'optimizations' were needed.

[0] I think it was this one: https://github.com/ilmenit/CC65-Advanced-Optimizations

Someone · on Sept 21, 2021

Nice article, but it doesn’t mention the really gnarly stuff such as using the fact that a subroutine happens to return with some flags set, or with some fixed value in the X register to shave of some initialization instruction in the code calling it.

A main advantage of the 6502 is that it only has 64 kilobytes of memory ;-). That means sufficiently advanced and motivated programmers can keep the entire program in their head, and also nudges them to avoid bloat such as the use of 16-bit integers.

cmrdporcupine · on Sept 21, 2021

Zero page is great, but has limitations, for sure. Lots of moving stuff back and forth into the accumulator in order to do anything with it. And not relocatable like in the 6809 or 65816 "direct page".

Some nice simple extensions to the 02 architecture would be:

1) relocatable direct page and stack like in the 816. 2) some way of aliasing A to a direct page address to avoid doing it by hand.

tom_ · on Sept 21, 2021

I think you could use zero page as the data stack. Treat X is your frame pointer, and the zero page "address" is then the offset into the frame. Most instructions have a zp,X addressing mode. (This scheme works well with (zp,X) too.) LDY zp,X and STY zp,X are available, and the useful read instructions have abs,Y forms. So you can do lookups into global tables with an 8-bit local variable index without having to save X.

You'd need a little region for making use of (zp),Y, probably callee-saved, putting previous values on the return stack with PHA.

leeter · on Sept 21, 2021

I wonder if the CSG-65CE02 wasn't an attempt to make C easier for the C6x/c128 line. Unfortunately it never saw the light of day except as a serial controller and isn't available today

https://en.wikipedia.org/wiki/CSG_65CE02

sagacity · on Sept 21, 2021

They actually address some of that on their project page, see: https://llvm-mos.org/wiki/Findings

cmrdporcupine · on Sept 21, 2021

It's a good read, but I still maintain that the immovable zero page and stack make the '02 sub-optimal. The 816 lets you move both around, and the WDC C compiler at least does some nice things with this to allow a proper stack frame.

I suspect that an LLVM backend for the 816 would have to be something quite a bit different from the 02.

colejohnson66 · on Sept 22, 2021

The downside of the 65x816 compared to the 65x02 is the address/data line multiplexing. In order to add 8 more address lines without going above 40 pins,[0] they multiplexed them onto the data lines. So to decode the address, you need some support circuitry for latching and gating. The 65x816 datasheet (from WDC) gives a schematic for doing so, but it’s not as simple/clean as a 65x02.

I personally would choose the 65x816 over the other for a new design, but I can understand why it’s not as popular.

[0]: 40 was the de facto maximum. Although, the M68k had 64. That thing was a monster in size.

emrk · on Sept 21, 2021

Author of mentioned post on 6502.org forum here. In the meantime I worked a bit on implementing proper rust target-triple for 6502 (mos-unknown-none), code is here: https://github.com/mrk-its/rust/tree/mos_target

Then standard cargo tool may be used to directly build 6502 executable, some examples: https://github.com/mrk-its/a800-rust-test or https://github.com/mrk-its/llvm-mos-ferris-demo

gergoerdi · on Sept 22, 2021

That's cool! I wanted to avoid having to build Rust and/or LLVM from source myself, hence the somewhat awkward "tell Cargo we're on default target, let Clang sort it out at link time" setup.

codedokode · on Sept 21, 2021

I am not sure if it is a good idea to compile code targeted to modern processors to 8-bit CPUs like 6502. For example:

Languages like C (or Rust) allocate variables on the stack because it is cheap with modern CPUs, but 8-bit CPUs don't have addressing modes to access them easily. (by the way, some modern CPUs like ARM also cannot add a register to a variable on the stack).

The solution is not to use the stack for variables and instead use zero-page locations. As there are only 256 zero-page bytes, same locations should be reused for variables in different functions. This cannot be used with recursive functions, but such code is ineffecient anyway so it is better not to use them at all and use loops instead.

Another thing is heap and closures (that allocate variables on the heap). Instead of heap the code for 8-bit CPUs should use static allocation.

The article contains an example of 6502 code compiled from Rust and this code is inefficient. It uses too much locations for variables (rc6-rc39) and it wastes time saving and restoring those locations in prologue/epilogue.

No wonder that programs run slowly. It would be much better to compile CHIP-8 directly to 6502 assembly.

mysterymath · on Sept 21, 2021

Most of the inoptimality in the article isn't due to the issues you've raised, but rather due to us just starting to optimize LLVM-MOS.

First, I have utterly no idea why there are so many calls to memset; it looks like it's unrolling a loop or something... poorly. It also doesn't seem to be reusing registers when setting up the calls; that's also bad and should be fixed.

Second, if you take a look at the actual structure of the prologue and epilogue, you might notice that it's copying zero page to an absolute memory region called __clear_screen_sstk. This is because LLVM-MOS ran a whole-program analysis on the program and proved that at most one activation of that function could occur at any given time. Thus, it's "stack frame" was automatically allocated statically as a global array, not relative to a moving stack pointer.

The reason that the prologue and epilogue spends so much time copying in and out of the zero page is just that we haven't taught LLVM-MOS how to access the stack directly, but there's no technical obstacle to doing so. Once that's done, the whole body of the function would operate on __clear_screen_sstk directly, and the prologue and epilogue would disappear completely.

Of course, from the first point, you shouldn't need any stack locations to do the body of this routine; there's a big ball of yarn here, but pulling on any of a number of threads would unravel it.

antirez · on Sept 21, 2021

Strange exercise because Rust and the 6502 original programming mood are totally different: a word of cleverness and the most obscure side effects in order to squeeze the last clock cycle. But everything is "hack value", I will respect.

person22 · on Sept 21, 2021

I don't think you can get past that the 6502 was meant to be programmed in assembly. Some of the tricks needed to optimally use memory just don't lend themselves to higher level languages. I started with a lot of basic and then moved to assembler because it was the easiest path.

rob74 · on Sept 21, 2021

Er... the article doesn't make it clear, but I guess we're talking about cross-compilation here? So it's not "Rust" (or, as he writes later, LLVM) running on the 6502, just the code generated by the Rust compiler.

Still cool though!

bluejekyll · on Sept 21, 2021

Don’t most people generally mean the target binary from the compiler and not the compiler itself when someone says “see * running on this architecture”?

I can see for some dynamic languages there being a destination between the two, but for compiled binaries, generally Rust on X, it doesn’t seem important if rustc also runs on X (especially when discussing micro-controllers since one would rarely run a full compiler on the chip itself).

fmakunbound · on Sept 21, 2021

> Don’t most people

And the rest are Forth users happily running interactive, extensible compilers with built in assemblers, block IO, screen editors in a multiuser, multitasking environment.

kjs3 · on Sept 21, 2021

All 10 of them...sure.

rob74 · on Sept 21, 2021

Well, when someone says "see Doom running on this architecture", they usually do mean that Doom is running on the architecture. So "Rust for the MOS 6502" or something like that would have been better. But yeah, maybe I'm too nitpicky and unfair to a non-native speaker...

ww520 · on Sept 21, 2021

So WASM on 6502 next?

fallat · on Sept 21, 2021

It looks like so much Rust code to generate the simplest of 6502 code. No thanks.

gergoerdi · on Sept 21, 2021

Did you look at chirp8-engine, or only chirp8-c64? The value add is not in the parts that interface with the C64 internals; probably using C for that would make for nicer code. But I wanted to push as much into Rust as I could in the short amount of time I spent on this.

The real advantage of using Rust is in the actual program logic. E.g. the instructions are decoded into an algebraic datatype (in https://github.com/gergoerdi/chirp8-engine/blob/7623353a8bf0...) and then that is consumed in the virtual CPU (https://github.com/gergoerdi/chirp8-engine/blob/7623353a8bf0...). Rust's case-of-case optimization takes care of avoiding the intermediate data representation at runtime.

boomlinde · on Sept 21, 2021

No thanks indeed, but I completely agree with this sentiment from the article:

> It is worth pointing out that the amazing thing about chirp8-c64 is not how well it works, but that it works at all.