So 2000 fps, 16 million pixels, and 7000 operations per pixel, works out to 224 TFLOPS.
An RTX 4090 is advertised as being able to compute 82 TFLOPS.
Where do you think the extra is coming from? Is it just straightforward optimisations like constant folding? Or do you think the compiler is noticing that 1000 iterations of the loop doesn't change the answer and optimising it down to just 1 loop?
There are 7866 instructions including the constants. There are 1406 constants leaving only 6460 real instructions to be executed (max, min, add, neg, sub, square, sqrt). Those constants can be directly encoded into most (possibly all) of the instructions when mapped to real machine instructions, which a compiler would likely do unless there was a good reason to keep it in a register or memory location.
Something I saw from a cursory scan were some near duplicate instruction (haven't written code to find all instances):
a = x - c
;; some time later
b = c - x
Recognizing that b is the negation of a, you can convert the calculation of b to:
b = -a
This may or may not be faster, but it does mean that we can possibly forget about c earlier (helpful for register allocation and can reduce memory accesses).
Negations can sometimes be removed depending on the lifetime of the variable and its uses. Taking the above, suppose we follow it with this use of b:
d = e + b ;; or b + e
We can rewrite this as:
d = e - a
Or if it had been:
d = e - b
It can be rewritten to:
d = e + a
And if there are no more uses of b we've eliminated both that variable and another instruction from the program. These and other patterns are things a compiler would detect and optimize.
Though looking at the uses of the results from neg, I think most are used in the sequence of max/min instructions following them so it may not be possible to eliminate them as I showed above.
I compiled it for Ampere and counted 6834 actual F32 operations in the SASS after optimizations. I only counted FFMA, FADD, FMUL, FMNMX, and MUFU.RSQ after eyeballing the SASS code, so there might even be more. It's possible the FMNMX doesn't actually take a FLOP since you can do f32 max as an integer operation, and perhaps MUFU.RSQ doesn't either, but even if you only count FFMA, FADD, and FMUL there are still 3685 ops.
Yeah, it took me an hour-ish to get the performance and then an hour-ish to figure out how to actually render it correctly. I had never seen the P2/P5 PGM format before so I didn't realize P5 files were binary. :)
> (Spoilers: it also allocates 60+ GB of RAM for intermediate results, so consider reducing the image size before running it on your own machine)
The input algorithm is essentially written in SSA form, and it's easy to analyze when each "register" stops being used; then you can drop the arrays after their last use. Turns out the number of live registers never exceeds 160 at any point in time, and with this one addition the max memory usage drops to barely above 1GB.
I don't know why you are downvoted (at the time of this writing). That's a very good point and makes the simple Python solution feasible on many more computers (due to reduced memory footprint). It's also a straightforward pass (once you know what liveness analysis is). Since there are no loops it's just one pass starting from the end and working backwards to determine whether a variable is live or not, and then an extra check on each iteration to decide whether to delete particular arrays.
Yeah, this took me a while. Basically, the entire file represents a sort of assembly language for a function. You need to evaluate that entire function for each pixel in the output image.
Each line starts with a line number (with a leading underscore). This is followed by an instruction, and zero, one, or two arguments. The arguments can be either floating point constants or integers (with leading underscores) which represent the results of those corresponding lines earlier in the function.
The x and y coordinates of each pixel are loaded in the assembly language using the var-x and var-y instructions.
Break down the range of x values from -1 to 1, and y values from -1 to 1, each into 1024 parts (or whatever resolution you want to plot at). Evaluate the function for each (x,y) coordinate. If the resulting number is negative colour the pixel white, otherwise colour it black.
For a given (x, y), you color the pixel black if you're outside of a shape, and color the pixel white if you're inside the shape.
The math formula is a way to describe the contour of the shape. To tell if you're inside or outside of a shape, you plug in your (x, y) coordinate into the math formula, and if the output is > 0, then you're outside. If the output is < 0, then you're inside.
I think they call it a signed distance field. You should watch his talk that explain it. It's really great.
Can anyone explain where this blob of "assembly language" comes from? In practice, the author presumably with the desired output image and backed into this esoteric format, but I'm not following how.
What is considered an acceptable preprocessing or transformation? In the limit case, a static executable that computes nothing and outputs a constant, literal image probably violates the intention of the question.
> Can anyone explain where this blob of "assembly language" comes from?
Assembly language is definitely the right analogy: it's a low-level target generated by higher-level tools. In this case, the expression came from a Python script calling this text(...) function:
The font is hand-built from geometric primitives (rectangles, circles, etc) and CSG operations (union, intersection, difference)
> What is considered an acceptable preprocessing or transformation?
I'm looking for interesting ideas, and to mine the depths of PLs / compiler / interpreter / runtime research. Just returning a fixed image isn't particularly interesting, but (for example) I just updated the site with a compile-to-CUDA example that shows off the brute force power of a modern GPU.
What are the practical implications of this kind of assembly language? Surely there’s more efficient means of describing 2D SDFs?
Fun exercise! I’ve been enjoying trying to find some new ways to approach the challenge. I managed to build a single string expression for the entire program, so it could be evaluated per-pixel in a shader, but it turns out the expression is too complex for WebGL & WebGPU and the shader fails to compile.
My next thought would be to evaluate the program at a low resolution to create a low res SDF texture for the shader to draw at a higher resolution. Some information will probably be lost, though.
> What are the practical implications of this kind of assembly language? Surely there’s more efficient means of describing 2D SDFs?
By analogy, you wouldn't program in LLVM IR, but it's a useful intermediate representation for a bunch of higher-level languages. Higher-level tools can target this representation, and then they all get to use a standard set of optimizations and algorithms (fast evaluation, rendering, etc).
Thanks so much for writing this up and sharing. It sounds like you've been through a horrific few years, but your perseverance in creativity is inspiring and fascinating. I wish you well in your efforts to make a better life for yourself, and I hope your work finds a wider audience.
Some of these solutions seem to copy the .vm file and rewrite it in another programming language. Is this permitted? Seems like cheating to not count the time spent compiling and transforming that code in the runtime of the program.
Looks SDF-ish. My guess is convert the letters to curves, cut them into pieces without holes so each piece is the set of points between certain curves, write down SDFs for each curve (negative on the "inside", positive on the "outside"), combine all SDFs in a piece with intersection (max), then combine all pieces with union (min).
the text contains a math expression. The math expression can be interpreted as an abstract syntax tree. The AST can also be translated into bytecode and interpreted by a stack-based virtual machine.
I'm thinking that we can parse the input into an expression tree, and then for each x coordinate, have each node in the tree tell us which intervals of y coordinates it would return a value < 0 (or <= 0 without loss of generality). Since we're basically dealing with real numbers we can pretend true 0 doesn't exist.
Composing these intervals is trivial for all operations except addition/subtraction. (For example, `mul` is less than 0 if exactly one of its arguments is less than 0, `neg` is less than 0 if its argument is not less than 0, `min` less than 0 if either of its arguments is less than 0, etc.).
And then we define add(a, b) as sub(a, neg(b)). So then we only need to work out which y values sub(a, b) will return a negative result for.
sub(a,b) is negative when a<b.
We can have the sub() node asks its 2 child nodes which values they would return a negative result for. Every y coordinate where a gives a negative result and b gives a positive result is a y coordinate where sub(a,b) is negative. Every y coordinate where a is positive and b is negative is positive. For the remaining y values I think we have to actually evaluate the children and find out which ones give a negative result.
Obviously memoise the evaluation so that we don't repeat any work, and turn the expression tree into a DAG by combining any identical nodes. Some subtrees won't contain var-x, so the memoisation of those nodes can persist across different x coordinates.
And then for each x coordinate we ask the expression tree to tell us which y coordinates give negative outputs, and plot those. It's possible that the idea would generalise to 2-dimensional intervals, not sure.
I haven't implemented this yet but I'm planning to try it out, but to be honest I don't expect it to be faster than Matt's GPU version based on recursive subdivision with interval arithmetic. Btw his talk is great https://www.youtube.com/watch?v=UxGxsGnbyJ4&t=80s
And, a second fun challenge would be to reverse engineer the input file to extract the font! I expect the file is basically a union of x/y translations of functions that plot individual letters, so it shouldn't be crazy hard to split out the top-level union, then split-out the x/y translations, and then collect the letters. It's possible that it's more complicated though. In particular, is each character translated in x/y individually, or is each character translated only in x to form each line of text, and then the line as a whole is translated in y? Or something weirder?