Workflows/orchestration/reconciliation-loops are basically table stakes for any service that is solving significant problems for customers. You might think you don't need this, but when you start needing to run async jobs in response to customer requests, you will always eventually implement one of the above solutions.
IMO the next big improvement in this space is improving the authoring experience. In short, when it comes to workflows, we are basically still writing assembly code.
Writing workflows today is done in either a totally separate language (StepFunctions), function-level annotations (Temporal, DBOS, etc), or event/reconciliation loops that read state from the DB/queue. In all cases, devs must manually determine when state should be written back to the persistence layer. This adds a level of complexity most devs aren't used to and shouldn't have to reason about.
Personally, I think the ideal here is writing code in any structure the language supports, and having the language runtime automatically persist program state at appropriate times. The runtime should understand when persistence is needed (i.e. which API calls are idempotent and for how long) and commit the intermediate state accordingly.
There seems to be a lot of negativity about this opinion, but I heartily agree with you.
Anytime you’re dealing with large systems that have a multitude of external integrations you’re going to need some kind of orchestration.
Anytime you perform a write operation, you cannot safely and idempotently do another IO operation in the same process, without risking a non-retryable exception event of the entire process.
Most people when faced with problem will look at some kind of queuing abstraction. The message fails, and you try it automatically later. If you’re a masochist you’ll let it go in a dead letter queue and deal with it manually later.
Sagas is one way to orchestrate this kind of system design. Routing slips is another that has the benefit of no central orchestrator and state is just carried in the routing slip. Both are adequate but in the end you’ll end up with a lot of infrastructure and architecture to make it work.
Systems like Temporal take a lot of that pain away, allowing developers to focus on writing business code and not infrastructural and architectural code.
So I am fully in on this new pattern for the horrible integrations I’m forced do deal with. Web services that are badly written RPC claiming to be REST, or poorly designed SOAP services. REST services that choose to make me to a GET request for the object I just created, because REST purists don’t return objects on creation, only location headers. Flaky web services that are routed over 2 VPN’s because that’s the way the infrastructure team decided to manage it. The worst cast I ever had to deal with was having to process XML instructions over email. And not as an attachment, I mean XML as text in the body of the email. Some people are just cruel.
Give someone a greenfield and I’d agree, simplicity rules. But when you’re playing in someone else’s dirty sandpit, you’re always designing for the worst case failure.
And for the readers that are still wondering why this matters, I recommend this video from 7 years ago called “six little lines of fail”.
This seems like a weird benchmark, reading from /dev/urandom and gzipping random data does not seem like something most folks will want to do. It even appears like /dev/urandom speeds differ greatly on various architectures [0] and there are issues with /dev/random being fundamentally slow due to the entropy pool [1] (but I guess this is why the author uses /dev/urandom).
It would be better to measure something more related to what docker users will actually do, like container build time of a common container, and/or latency of HTTP requests to native/emulated containers running on the some container.
One reason to feel positive about the virtualization issues is that Rosetta 2 provides x86->ARM translation for JITs which an ARM-based QEMU could perhaps integrate into it's own binary translation [2].
I'm glad somebody said something! Yes the gzip perf test is pretty silly, but illustrates a significant difference. /dev/urandom throughput on this setup was about 100 MB / s so it wasn't a bottleneck for this test - the bottlneck was gzip.
Feel free to come up with a performance test yourself! I personally want to know what an HTTP test would look like. You can run an ARM image by running:
docker run -it arm64v8/ubuntu
Unfortunately, Rosetta 2 is not going to help here. Rosetta 2 translates x86 -> ARM, but only on Mac binaries. It does not translate Linux binaries, and cannot reach inside a Docker image.
Was your emulation done with qemu user space emulator[1] (the syscall translation layer) or qemu system emulator[2] (the VM)? If it was qemu-system you might have better numbers with qemu-user-static, which does binary translation similar to Rosetta 2 rather than a being a full system emulator with all its overhead.
You can probably use qemu-user-static to translate x86-64-only binaries in a Linux container on an ARM machine, too, but I have never tried.
That's strange, in my experience it shouldn't have 6x slowdown. Probably might be due to several factors, but here's your test, running on my system without Docker:
Ryzen 3900X (host machine)
$ dd if=/dev/urandom bs=4k count=10k | gzip >/dev/null
10240+0 records in
10240+0 records out
41943040 bytes (42 MB, 40 MiB) copied, 1.02284 s, 41.0 MB/s
qemu-aarch64-static
$ dd if=/dev/urandom bs=4k count=10k | proot -R /tmp/aarch64-alpine -q qemu-aarch64-static sh -c 'gzip >/dev/null'
10240+0 records in
10240+0 records out
41943040 bytes (42 MB, 40 MiB) copied, 3.33964 s, 12.6 MB/s
> Emulators can run a different architecture between the host and the guest, but simulate the guest operating system at about 5x-10x slowdown.
I think this is a misleading statement because it implies that there is a constant performance overhead associated with CPU emulation. In reality, the performance relies heavily on the workload, more so with JIT-ed emulators.
Regarding this specific benchmark, I think there are two main factors contributing to the poor performance. The first factor is that the benchmark completes in a short period of time. With JITs, performance tends to improve for long running processes because JITs can cache translation results allowing you to amortize the translation overhead. Another factor is that your benchmark is especially heavy on I/O, meaning that it spends a lot of time translating syscalls instead of running native instructions.
I'd also like to add that CPU emulators sans syscall translation should work for any binaries, even those targeted for Linux. It would require a copy of the Linux kernel, but Docker won't work without it anyways.
So I'm not familiar with how Darwin does things, but on most FOSS unixes it's easy to use qemu to run one arch on another, either full system or just user mode emulation (which when wired up correctly lets you seamlessly execute ex. ARM binaries on an x86 system). I would expect it to be easy enough to either set up user mode translation, or just swap Docker's backing hypervisor with an x86 VM. Or, worst case, just run qemu-system-x86_64 on your ARM Mac, run Linux inside that VM, and run Docker on that Linux; SSH in and it should be mostly transparent.
One benchmark would be to track down a python/JS/etc based "hello world" demo container. Base one version on Intel and the other on ARM, and measure each versions container build-time and request latency after it is set-up.
If changing the base image is all that's needed and both Dockerfiles otherwise assume ubuntu, this should not take too long.
I use an 2016 Macbook i7/8GB as my daily development system. I love it. It's light and portable, which is important for me.
The main thing to understand about these machines is that the i7 CPUs are about as good as any other CPU on the market (aside from less cores), but they're fanless which means they rely on the case to dissipate heat, and so will thermally throttle during long-running high CPU jobs. They're perfect for short lived jobs, even multi-minute compilations, but will have a hard time getting through repeated long jobs that require long durations of high CPU.In short, great for bursty computations with longer idle times where the laptop has a chance to cool off.
For instance, I build brew packages, full llvm builds, even small ML models, etc, without problems, because the machine starts cold and there is enough time after the job is done for the machine to cool off again. My machine suffers on tasks like Docker+Kubernetes/minikube that run a constantly polling VM in the background that takes 25-100% CPU when running idle.
For instance, someone in this thread mentioned the iOS emulator might be difficult to run. This may not be true, so long as the emulator does not constantly use lots of CPU - if it just uses high CPU in response to input events, it will likely be fine.
This 'partial interpretation' trick used here has also been used successfully in the database community to accelerate whole queries, etc. Tiark Rompf's group in particular has been pushing this idea to it's limit.
> When a value is retrieved via its key, IronDB...
> Looks up that key in every store.
> Counts each unique, returned value.
> Determines the most commonly returned value as the 'correct' value.
> Returns this most common correct value
> Then IronDB self heals: if any store(s) returned a value different than the determined correct value, or no value at all, the correct value is rewritten to that store. In this way, consensus, reliability, and redundancy is maintained.
I'm not sure any of these properties are ensured the way anyone would want. Why do this? What are the failure modes this protects against? If only one of the backing stores is active when a result is written, and then the others later become active with no data, is blank data returned for the prior result? It seems like recording timestamps would fix this problem nicely on a single system and make this thing overall quite reliable.
Cookies are frequently cleared by users, and the other datastores -- IndexedDB, LocalStorage, and SessionStorage -- can be unceremoniously purged by the browser under storage pressure.
Thanks for the response! FWIW I'm overall really interested in this, as I maintain an application that uses localStorage to keep very important data while the app is offline until it can be uploaded.
I recently tested localForage [0] to see if it could be more reliable and get around storage limitations of localStorage. Unfortunately, LocalForage has a very annoying problem where it picks a single storage backend to use, but the one it chooses can switch on page reloads (yes, even on the same device/browser.) and this switching causes data loss as keys stored in the other backends are unaccessible. I'm very interested to see if IronDB can help here! Thanks for working on this.
> Why do this? What are the failure modes this protects against?
I have a use case for this: I've written what is basically a crud web app for a non-profit. That's an offline app (appcache) and right now uses localstorage for keeping data offline in the browser while in the field. That is pretty brittle as is, but dedicated application installations didn't fit the budget. The app is used in the field by volunteers all over the country. With volunteers fluctuating and many people being privacy conscious, it happens regularly that they change Firefox preferences to delete cookies etc. on shutdown, and now newer Firefoxes deprecating localstorage, potentially resulting in data loss.
This comment is excellent. The title of the original post should be: "When optimising code, never guess, always read the bytecode/assembly."
Without actually reading the assembly/bytecode/etc, you end up speculating about silly things like 'the two evaluations and assignments can happen in parallel, and so may happen on different cores.'.
Possibly a subtitle: It's never what you think it is. I've worked in embedded and real time environments before and optimisation used to be my thing, but I am always surprised at how badly I guess what the problem is. It's a hard lesson to teach other people though because programmers are smart and smart people tend to default to thinking that they are correct :-)
But you are right: even when you isolate where the code is slow, you've still got a lot of work to do to find out why it is slow.
Indeed, I was unduly influenced by the code I was writing in the late 80s and early 90s that really did take languages with multiple assignments like this and ran them on different processors. You say it's a silly thing, but we used to do it - things have changed.
Added in edit: The magic term is "execution unit" not "core". As I say, things have changed, and the bundling of multiple execution units into each core, and multiple cores into each processor, is different in interesting and subtle ways from the situation I used to code, where I had a few hundred, or a few thousand, processors in each machine, but the individual processors were simpler.
I didn't mean to say this was a silly thing to do - most modern processors execute instructions out of order on multiple ALUs.
The problem is that the abstraction layer between the python code in question and the processor's instruction stream is so thick that it's hard to say one way or the other that the processor is indeed executing that particular pair of instructions in parallel. It's definitely executing many instructions out of order, but it's unclear (without inspection of the python interpreter and its assembly) what's happening at the machine level.
Looking at the bytecode of the python program at least begins to tells us that the python bytecode of the two versions is fundamentally different, which could account for the performance difference. Although, what exactly makes the material difference is also under debate elsewhere in the thread. :)
I'd say without measuring first you have too much to look through for anything large enough to be interesting. So measure, drill in, eliminate some first obvious culprits, measure again, then look at assembly.
I'm mentioning obvious because sometimes it really is, like doing a linear search in a hotspot where it should be a binary tree and things like that.
Measuring is more important. Knowing "a is faster than b" without understanding why is more useful than guessing what should be faster based on assembly and getting it wrong because you don't perfectly understand how the CPU is actually executing your code.
Looking at assembly/byte code is interesting to understand what you could tweak, but again you need measurements to verify.
Minor nitpick about their exit code technique [0]: The command checks if the table exists, but it does not appear to re-run if the source file has been updated. Usually with Make you expect it to re-run the database load if the source file has changed.
It's better to use empty targets [1] to track when the file has last been loaded and re-run if the dependency has been changed.
Last I checked, MapBox's WebGL based vector tile renderer was a cut above the rest. For simple mapping, Google's stuff may cut it, but MapBox has the ability to draw complex vector polygons over a wide area that render/zoom/scroll fast and provide a nice experience. This is pretty important if you want to provide fine grained analysis of geographic areas.
If you wanted this on GMaps, you were stuck rendering all your vectors to image tiles, hugely increasing the size. MapBox's vector support made this very easy! It may be that Google/free options have caught up in this space, but I haven't re-evaluated in a few years.
I think MapBox could also be a winner in the GIS space as the GIS options have not made the most graceful move to the web.
Mapzen, a now sadly defunct mapping startup, also had an awesome (if I say so myself - I used to work there, but not on that team) vector tile renderer for browser and mobile. Check it out at https://github.com/tangrams
Any chance mapzen would open source the code used to make metro extracts? The formats your team had to offer were so much nicer than working with raw OSM data
I'm glad you found them useful! The code which made all the metro extracts was embedded as a Chef recipe, although I'm sure you could just extract the bits which do stuff from the Chef wrappers: https://github.com/mapzen/chef-metroextractor
What about the fact that these instructions might get partially executed in the pipeline before the branch gets resolved and the pipeline flushed? If a mis-fetched instruction can reach the LSU stage before the pipeline gets flushed, it might serve as a speculative memory load...
They're not partially executed. The branch predictor only fetches instructions. They might be decoded, but it's not an out-of-order processor-- pipeline stages only proceed if the previous phase is correct.
It's an in-order CPU, so that "issue" phase (pipeline step 5) stalls until the instruction pointer is resolved. Instructions must be issued to the "AGU Load" functional unit, which is what actually performs the read and pulls data into the cache hierarchy.
Note also that a single speculative memory load is insufficient for Spectre. You need two speculative memory loads.
IMO the next big improvement in this space is improving the authoring experience. In short, when it comes to workflows, we are basically still writing assembly code.
Writing workflows today is done in either a totally separate language (StepFunctions), function-level annotations (Temporal, DBOS, etc), or event/reconciliation loops that read state from the DB/queue. In all cases, devs must manually determine when state should be written back to the persistence layer. This adds a level of complexity most devs aren't used to and shouldn't have to reason about.
Personally, I think the ideal here is writing code in any structure the language supports, and having the language runtime automatically persist program state at appropriate times. The runtime should understand when persistence is needed (i.e. which API calls are idempotent and for how long) and commit the intermediate state accordingly.