If QOI is interesting because of speed, you might take a look at fpng, a recent/actively developed png reader/writer that is achieving comparable speed/compression to QOI, while staying png compliant.
I find it interesting that QOI avoids any kind of Huffman style coding.
Huffman encoding lets you store frequently used values in fewer bits than rarely occurring values, but the cost of a naïve implementation is a branch on every encoded bit. You can mitigate this by making a state machine keyed by "accumulated prefix bits" and as many bits as you want to process in a whack, these tables will blow out your L1 data cache and trash a lot of your L2 cache as well.¹
The "opcode" strategy in QOI is going to give you branches, but they appear nearly perfectly predictable for common image types², so that helps. It has a table of recent colors, but that is only of a few cache lines.
In all, it seems a better fit for the deep pipelines and wildly varying access speeds across cache and memory layers which we find today.
␄
¹ I don't think it ever made it into a paper, but in the mid-80s, when the best our Vax ethernet adapters could do was ~3Mbps I was getting about 10Mbps of decompressed 12 bit monochrome imagery out of a ~1.3MIP computer using this technique.
² I also wouldn't be surprised if this statement is false. It just seems that for continuous tone images one of RGBA, DIFF, or LUMA is going to win for any given region of a scan line.
(I write comments with footnotes in the same style as you, but use “—⁂—” as the separator, via Compose+h+r (name from the HTML tag horizontal rule). Good fun being able to use Compose+E+O+T, Compose+E+T+X and Compose+F+F in this comment; I added the full set to my .XCompose years ago.)
One thing to note is that QOI composes really nicely with high quality entropy encoders like LZ4 and ZSTD. LZ4 gives a roughly 5% size reduction with negligible speed impact, and ZSTD gives a 20% size reduction with moderate speed impact (https://github.com/nigeltao/qoi2-bikeshed/issues/25).
I would think that you could use a hybrid approach where you have a table that is perhaps 9 or 10 bits wide and covers many of the more common codes, which will by definition be more common. Should be small enough to fit in the cache. Then do something slower for the very long codes. This way you avoid difficult branches most of the time.
Exactly. Normally you use second-level tables for the long codes. In the first table, if code doesn't reach a leaf, tbl[code] holds a pointer to the next table to use.
Funnily enough it's just a few days since I did some similar code to support writing PNGs from a small embedded device. In this case the full deflate algorithm seemed like overkill in memory and CPU requirement, and most of the images were probably going to be served over a LAN anyway.
https://twitter.com/toitpkg/status/1471986776357097475
Since we are rehashing this for the 3rd (4th?) time, I'll repeat mine (and apparently many others) key critique: there is no thought at all to enabling parallel decoding, be it, thread-parallel or SIMD (or both). That makes it very much a past millennium style format that will age very poorly.
At the very least, break it into chunks and add an offset directory header. I'm sure one could do something much better, but it's a start.
This project is interesting because of how well it does compared to other systems of much higher complexity and without optimizing the implementation to high heaven. We can all learn something from that.
Good question. The answer is all the poor souls that N years later find themselves stuck with a data in a legacy format that they have to struggle to decode faster.
Of all the artifacts in our industry, few things live longer than formats. Eg. we are still unpacking tar files (Tape ARchieve), transmitted over IPv4, decoded by machines running x86 processors (and others, sure). All of these formats couldn't possible anticipate the evolution that follow nor predicted the explosive popularity they would have. And all of these (the latter two notably) have overheads that have real material costs. IPv6 fixed all the misaligned fields, but IPv4 is still dominant. Ironically, RISC-V didn't learn from x86 but added variable length instructions making decoding harder to scale than necessary.
I'm not sure what positive lessons you think we should learn from QOI. It's not hard to come up with simple formats. It's much harder coming up with a format that learns from past failures and avoids future pitfalls.
QOI is designed with a very specific purpose in mind, which is fast decoding for games. This kind of image will be very unlikely be large enough to benefit from multi threading, and if you have a lot of them you can simply decode in parallel. It’s not meant to the the “best” image format.
Unrelated to the rest of your comment, but risc-v does not have variable-length instructions. It has compressed instructions, but they're designed in such a way to be easily and efficiently integrated into the decoder for normal instructions, which are all 32 bits.
My day job for 6+ years is implementing high perf RISC-V cores and my name is in many of the RISC-V specs.
Variable length ISAs are characterized by not being able to tell the beginning of an instruction without knowing the entrypoint. This applies to RISC-V with compressed instructions. Finding the boundaries is akin to a prefix scan and has a cost roughly linear in the scan length, but IMO the biggest loss is that you can’t begin predecode at I$ fill time.
I fought against making the _current_ way to do compressed instructions a mandated part of the Unix profile, but RISC-V was (at least at the time) dominated by microcontroller people and there was a lack of appreciation of the damage it incurred. A lot of people far more senior than me couldn't believe what happened.
Interesting to contrast with Arm which upon defining Aarch64 did _away_ with variable length instructions and thus also page crossing ones. Maybe they knew something.
> IMO the biggest loss is that you can’t begin predecode at I$ fill time.
That helps enough to overcome the increased code size?
I really wouldn't say they learned nothing from x86, though. You only have to look at 2 bits, and if you can get your users to put in the slightest effort then compilers can be told not to use C.
That's a false strawman. There are infinitely many ways to achieve the same or better density without the drawback. Allowing instruction to span cache line, or even pages, is a mistake that we'll pay for forever.
The simplest possible mitigation would have been to disallow an instruction from spanning a 64-byte boundary. It would have almost no impact on instruction density, but it would have saved a lot of headaches for implementations.
Strawman? I wasn't even trying to characterize anyone else's point, I was just trying to list some significant improvements over x86.
> The simplest possible mitigation would have been to disallow an instruction from spanning a 64-byte boundary.
Sure, that sounds good. But before this you hadn't even mentioned any problems with split instructions that need to be mitigated.
(You did mention decoding without a known entry point, but a rule like that doesn't guarantee you can find the start of an instruction. And if it would help to know that a block of 64 bytes probably starts with an aligned instruction, that seems like something you could work out with compiler writers even without a spec.)
I did forget to mention the requirement that you can't branch into the middle of an instruction. If you have both of these constraints then you can unambiguously determine the location of all instructions in any aligned 64-byte block, including at I$ fill time.
Implementing this would require instruction fetch to take an exception on line-crossing instructions (which must be illegal) and a change to the assembler to insert a 16-bit padding nop or expand a compressed instruction to maintain the alignment. There is nothing needed from the compiler (or linker AFAICT). JITs will have to be aware though.
You'd also need to guarantee that there are no constants or other non-instruction data in the same cache line as instructions. If that's a reasonable constraint then sure, that sounds like it would be helpful.
Those poor souls N years later will either have to decode a very few images, which is still fast enough, or decode a lot of images, which can be parallelized and run concurrently on a per-image level. In the very worst case, decode an extremely large single image, you're a bit out of luck, but that case would be rare, and you're still pretty fast at decoding anyway.
Creating formats and specs that are "future proof" is a noble goal. Criticizing QOI for not being able to be well parallelized inside the decode function, that seems more like a demand for a premature optimization to me...
> Criticizing QOI for not being able to be well parallelized inside the decode function, that seems more like a demand for a premature optimization to me
What? Faster encoding and decoding is one of the primary reasons for the format. Yet, QOI decoders are currently an order of magnitude slower than SSDs available today and even worse compared to DRAM! Now seems like the perfect time to look at possible optimizations to close that gap.
QOI is not an interchange file format like PNG or JPG, it's more akin to DDS or KTX (e.g. a specialized game asset pipeline file format which doesn't require a complex dependency for decoding).
A surprisingly good image format is to use a per line, or block, ar encoder, then compress the result with gzip on a low setting. It parallelizes very nicely, and beats png for encode, decode, and is trivial to implement.
I'm really not a fan of using the generally unsupported aspects of the file formats. Like sure, png supports alternate compression schemes which in combination with the fully generic datablocks system which I think technically accidentally makes it Turing complete. But its not like any reader ever supports it fully, hell most readers dont support showing the other images in a png file. Its also quite common that palettes aren't properly supported, in particular for uncommon combination. Palettes themselves are also not truly sufficient for what they are supposed to do, as that would require generic scalar to scalar functions. Its like this with every damned old image format. People try to make the format to end all formats and end up accidentally inventing really shitty programming languages with terrible separation of concerns which are terribly supported but work as people only ever use them for single image data.
Another stranger mistake is the use of generic compression a part of the format, its much better to leave that up to the filesystem or stream. I dont really understand why this was ever a good idea, but it certainly hasn't been one for decades.
The more recent blob formats people have developed aren't much better, they still confuse the specification of a single blob with the specification of the container format, and try to be convenient and fully generic at once. If people actually wanted a fully generic dataformat and accepted that this requires a programming language, just let that be python, instead of inventing some shitty new one...
No not animated,
I mean that the .png format specifies how to specify an arbitrary number of datablocks, without specifying how they should be used aside from amounts to an recommendation of which one should be shown.
It used by some videogames as assets, for instance storing what amounts as a thumbnail as a shown image, but leaving meshes and textures in the other internal datablocks.
Its auto regression. Think of a row of pixels as a signal y[n]. Assume y[n] = a y[n-1] + b y[n-2]. estimate a,b. Then store y[0], y[1], a, b, and the residual, then encode the four values and the residual which will mostly be zeros. You can vary the length and number of coefficients very cheaply, but a fixed two beat png in my tests. There are fast standard algorithms to find a,b and which along with some simple tricks to correct for the precision ensure lossless encoding. Its the mathematicians version of the ad hoc thing png tries to do.
isn't that very similar to what PNG does? I'm not sure what you mean by "ar encoder", but PNG uses a per-line filter and then adds DEFLATE on top of that.
A thread can scan the opcodes only to find cut-off points and distribute actual decoding to other cores. Surely you can do that with some simd magic, as well as the decoding threads, without needing to encode properties of today's simd in the encoding.
No it can't. The encoder doesn't insert any "cut-off points". In fact, nearly every chunk encodes the current pixel value in terms of one of the previous pixel values, so without knowing those it is impossible for a second core to start up and initialize its decoder state enough to produce correct output.
A top level thread scans the opcodes only to solve this, with no decoding and no writing, thus progressing faster in the stream than the child chunked decoding threads it progressively spawns.
Not as quick as a format with a chunk table, but faster than naive single core.
I did read your comment. You don’t have any explanation of how the top level scan can maintain the color index array for QOI_OP_INDEX chunks without doing any decoding.
I bet you can split big images into smaller QOI encoded chunks and decode those in parallel.
QOI is simple enough to remix the file format as much as you want in your own encoder/decoder (that's actually the USP), it's not meant as a standardized image exchange format, just something that lives in your own asset pipeline.
It's because this link (https://github.com/phoboslab/qoi) has been approved by mods for reposting: it appears on the "pool" list [1] [2]. Which is a bit odd because as you point out, a different link [3] for the same project already received lots of attention.
I wanted a simple format that allows you to load/save images quickly, without dealing with the complexity of JPEG or PNG. Even BMP, TIFF and other "legacy" formats are way more complicated to handle when you start looking into it. So that's what QOI aims to replace.
There's a lot of research for a successor format ongoing. Block based encoding, conversion to YUV, more OP-types etc. have all shown improved numbers. Better support for metadata, different bit-depths and allowing restarts (multithreading) is also high on the list of things to implement.
But QOI will stay as it is. It's the lowest of all hanging fruits that's not rotten on the ground.
> I wanted a simple format that allows you to load/save images quickly, without dealing with the complexity of JPEG or PNG. Even BMP, TIFF and other "legacy" formats are way more complicated to handle when you start looking into it. So that's what QOI aims to replace.
It's simply better suited for some types of images than others (e.g. the resulting size is sometimes bigger than expected). The main advantage is the very simple encoder and decoder with a specification that fits on a single page (and which still yields surprisingly good results for many image types):
I agree that it is quite easy to grasp the format in terms of implementation.
It seems basically like writing a image VM that accepts byte code. I think that could really be a way to specify many file formats more concicesly. If e.g. you chose the correct automata/transducer class one can easily e.g. specify some hedge grammar based XML file format and get a binary representation. Starting from grammars as a spec it is typically more difficult if you want to derive an implementation.
However I e.g. wonder from reading the concrete spec why you e.g. cannot differentially change the alpha channel leading me to the question what happens if images have different alpha levels.
"Everything" means two 32-bit integer values (width and height) in the header, that's hardly much of a downside ;)
Usually it's a good idea anyway to read file headers byte by byte instead of mapping a struct over it to avoid alignment, padding and endianness issues.
I just implemented this format in my game engine and the performances are crazy: images loading is 3.2 times faster (compared to png) and 40 times faster to generate game screenshot!
The generated screenshots are lighter (about 5%). However, the resource images in QOI format that I load are in average a little bigger (about 5% and sometimes until 35%). I guess it is not the perfect solution for AAA games which already use more than 30go nowadays.
"AAA games" (or rather any 3D games) typically use lossy image file formats which directly map to the hardware-compressed GPU texture formats (like DXTx/BCx in DDS or KTX containers), QOI is an interesting alternative where lossy formats don't work well though (e.g. pixel art).
Is there any open source audio compression format like that? Lossless and very fast. I haven't found any yet.
EDIT: I'm thinking about a format that would be suitable as a replacement for uncompressed WAV files in DAWs. Rendered tracks often have large sections of silence and uncompressed WAVs have always seemed wasteful to me.
I'd also like to know what's the best (or any) lossless audio compression process/tools.
My application is to send audio (podcast recordings) to a remote audio engineer friend who will do the post processing, then round trip it to me to complete the editing.
Wav is so big it makes a 1 hr podcast a difficult proposition.
MP3 is unsuitable because compression introduces too many artefacts the quality suffers unacceptably.
FLAC is limited to 24 bit depth. I was thinking of intermediate format suitable for use in DAWs and samplers that also supports floating point to avoid clipping.
24-bit integer and 32-bit float have the same dynamic range available, so you are not losing any fidelity.
However, frankly, if you're working professionally with audio like that, the best solution is simply to have sufficient disk space available to work with raw audio.
Use FLAC to compress the final product, when you are done.
They have the same precision but float has vastly larger dynamic range due to the 8-bit exponent. When normalized and quantized for output this does result in roughly the same effective dynamic range (depending on how much of the integer range was originally used).
The issue is audio is typically mixed close to maximum so any processing steps can easily lead to clipping. One solution is to use float or larger integers internally during each processing step and normalize/convert back to 24-bit integer to write to disk. Another (better imo) option would be to do all intermediate steps and disk saves in a floating point format and only normalize/quantize for output once.
I haven't worked with professional audio in over 25 years (before everything went fully digital) but I would be surprised if floating point formats were not an option for encoding and intermediate workflows. Many quantization steps seems like a bad idea.
> I would be surprised if floating point formats were not an option for encoding and intermediate workflows.
For bouncing tracks to disk, uncompressed 32-bit floating point formats are avaliable, but I am not aware of any fast losslessly compressed 32-bit floating point format.
All professional audio production software these days internally works with 32/64 bit floats. That's the native format, because it allows you to go above 0 dBFS (maximum level), as long as you go back below it at the end of the chain.
EDIT: Floating point is useful while you are working to avoid any accidental clipping. As an intermediate format, like a ProRes for video. FLAC is great as a final format.
Projects with 100+ tracks are not uncommon. Sampler/rompler of a single virtual instrument can play 10+ sounds simultaneously. Playback of an orchestral score with virtual instruments can easily go over 250 simultaneous sounds, so just a real-time playback (without any additional processing) would already be a challenge.
No, actually only first few dozen KB of each sample are usually preloaded into RAM. The rest is streamed from a SSD. One library of an orchestral section can have 100+ GB of samples. You wouldn't fit all sections in 128GB of RAM.
ZStandard is a very good compressor, with an especially fast decompressor. Maybe someone should try using this instead of zlib in an audio format (FLAC, WavPack, ...)
You can only seek within a gzip file if you write it with some number of Z_FULL_FLUSH points which are resumable. The command line gzip program does not support this, but it's easy using zlib. For example you might do a Z_FULL_FLUSH roughly every 50 MB of compressed data. Then you can seek to any byte in the file, search forward or backward for the flush marker, and decompress forward from there as much as you want. If your data is sorted or is a time series, it's easy to implement binary search this way. And standard gunzip etc will still be able to read the whole file as normal.
(but seriously, MODs can encode hours of audio into kilobytes, the downside is of course that they require a special authoring process which seems to be a bit of a lost art today)
It is nice but pity it does not have a "turn right" opcode: start going left, on the turn opcode, continue decoding pixels after turning 90 degrees to the right, until you hit a previously decoded pixel or the wall of the bounding box defined after the first two turns, in which case you turn automatically. The file ends when there's nowhere to turn.
This would eliminate the need for a header (bloat!) as the end of file is clearly defined, the size is defined after decoding the top and right line (second turn), and it's not so sensitive to orientation (a pathological image can compress very differently in portrait vs landscape in line oriented formats). Color profile can be specified in the spec.
Also allows skipping altogether some image-wide bands or columns that are of the background color (defined by the first pixel) as you do not need to walk over all the pixels.
An encoder just walking a regular spiral (no uniform bands detection) is not hard. The band thing is an accidental artefact of the idea but plain run length encoding probably already captures most of the effect so no imperative to actually implement it.
Speed, yes, it is a fair objection, until hardware adopts spiral encoding :-)
Seems like they benchmarked it against libpng which shows anywhere from 3-5x faster decompression and 30-50x compression. That's pretty impressive and even though libpng isn't the most performant of the png libraries, it's by far the most common.
I think the rust png library is ~4x faster than libpng which could erase the decompression advantage but that 50x faster compression speed is extremely impressive.
Can anybody tell if there's any significant feature differentials that might explain the difference (color space, pixel formats, .. etc)?
I think fundamentally it’s faster just because it’s dead simple. It’s just a mash of RLE, dictionary encoding, and delta encoding, and it does it all in a single pass. PNG has to break things into chunks, apply a filter, deflate, etc.
Filters are a form of delta encoding, and are optional for PNG encoders. Deflate is a form of dictionary encoding with RLE. There's no "breaking into chunks" in PNG — PNG can encode the entire image as a single iDAT chunk (and chunks themselves are so trivial they have no impact on speed).
You can choose not to do filtering when encoding PNG. Fast deflate settings are literally RLE-only, and you can see elsewhere in this thread people have developed specialized encoders that ignore most deflate features.
The only misfeature PNG has that slows down encoding is CRC. Decoders don't have to check the CRC, but encoders need to put one in to be spec-compliant.
I would love to have something to compress the raw files from my camera. They're huge, I have to keep a ton of them, and I also need to transmit them over internet for my backup.
I tried a few standard compression format, with very little luck.
Canon has devised a very smart (slightly lossy) compression format for newer cameras, but there's no converter that I know of for my old camera files.
So, unless I shell out large amounts of money for a new camera, I'm stuck sending twice the data over the internet. Talk about pollution...
There is the option of converting to DNG files. Which allow for really good lossless compression.
This does come at the cost of changing the file format, and risks losing metadata. That's why I personally decided to just buy more storage instead.
Come to think of it, have you tried running a modern compression algorithm on the data? I don't think I did. Could be cool if combined with ZFS or similar to get the compression done transparently.
At a previous job was looking at different binary parsing methods. This project looks quite interesting having binary format descriptions in YAML that then can be generated into your language of choice.
Not sure what you're expecting given how old it is. Why not write a polyfill as an exercise for yourself? Convert it to png, then save as an image tag to a data url.
It's always gonna be chicken-and-egg for this, and browsers won't spend the time sandboxing and supporting a codec until it's already popular.
So this will probably see a JS / Webasm shim, and if that proves popular, Blink and Gecko will consider it.
The day might come soon when browsers just greenlight a webasm interface for codecs. "We'll put packets in through this function, and take frames out through this function, like ffmpeg. Other than that, you're running in a sandbox with X MB of RAM, Y seconds of CPU per frame, and no I/O. Anything you can accomplish within that, with user opt-in, is valid."
https://github.com/richgel999/fpng
Disclaimer: have not actively tried either.