Coming from maths, I am confused by use of the term "idempotent" here. Unless we are talking about bootstrapping a compiler and I do not see how it applicable here. Am I missing something?
They are definitely redefining "idempotent" here, even from a software development standpoint.
Since we already have "deterministic build" and "reproducible build" for this, there's no need to overload "idempotent" (which already causes confusion).
Idempotent sounds confusing to me as an idempotent HTTP request (not very mathy, I know) I could run many times over, the effect being as if it were done once on the first time.
For a software build it sounds like running `make` multiple times. But those builds will be idempotent even without reproducibility/determinism as they happen on the same system.
Well, a second call to "make" with unmodified source on a local system has the same resulting output because make knows how to avoid rebuilding output files when the source hasn't changed since the last build.
The goal of reproducible/deterministic builds is typically to be able to produce the same output files without already having that output. (That's because the people interested in reproducible/deterministic builds are usually trying to prove something about the relationship between the source and the build output. Make doesn't really do that -- the only thing it can prove is that the mod times of output files are later than the mod times of the source files they depend on, according to the make file.)
So, yes, make is typically deterministic in a general sense (given a good makefile and certain assumptions that are pretty reasonable in the context of a developer doing development on a local machine). But isn't what people are looking for from reproducible/deterministic builds.
A lot of things in computer science that are conceptually functions actually aren’t, in this context when one compiles e.g. a C program there is a bunch of state, some pretty counter-intuitive, that the resulting object files or executable artifacts to change even though the source code did not. This has serious implication for relying on computer systems.
When a more systems-inclined person in computer science uses the word idempotent they very often mean something like “repeatable” or “deterministic”.
AFIAK this usage first gained traction in the theory of distributed systems, in which it (roughly) means that one application of an operation might change the state of the system, but subsequent applications will not change it further. Set union is sort of the canonical example.
Not speaking to your comment specifically, but adding context to the thread: idempotence would mean having the same result, regardless of whether you run something once, twice, or 10 times, over the same input set. Idempotence requires but goes beyond determinism, as it also accounts for any externalities and side effects.
For example, let’s consider a function that accepts a string as an argument and then writes that string to disk. We can consider the disk state as a side effect of the function.
The function itself is perfectly deterministic (output string is a predictable and consistent function of input string), but depending on the implementation of side effects it may not be idempotent. If, for example, this function room simply added the output to a file “output.txt”, this file would grow with every incantation, which is not idempotent. If instead we overwrote the output file so that it reflects only the singular output of the previous run, then the side effects would also be deterministic, that would be idempotent.
At a pedantic level you could redefine your scope of deterministic to not just include outputs, but also include the external state and side effects, but for practical purposes the above distinction is generally how deterministic and idempotent would be applied in practice in computing. I cannot speak to the math-centric view, if there is a different definition there.
This captures the mathematical definition too which is just that an element x is idempotent if x applied to itself gives you back x. I.e, what you said that the function applied many times produces no change to the system.
That may be true on a theoretical level, but if you’re talking practically to a data engineer that’s the definition of idempotence you’re going to find they expect.
Practical consideration might be that a disk may experience a fault in one of the executions: works fine a hundred times but fails on 101st (eg hits disk-full error or a bad block).
But that simply means it's harder to implement true idempotency when it comes to disk usage.
This is why the problem is usually simplified to ignore unhappy paths.
I fear y'all (or I) may be dancing a bit too deeply into hypothetical here...
The idempotent version of this function doesn't blindly write. Most of Ansible, for example, is Python code doing 'state machines' or whatever - checking if changes are needed and facilitating if so.
Where y'all assume one function is, actually, an entire script/library. Perhaps I'm fooled by party tricks?
Beyond all of this, the disk full example is nebulous. On Linux (at least)... if you open an existing file for writing, the space it is using is reserved. Writing the same data back out should always be possible... just kind of silly.
It brings about the risk of corruption in transit; connectors faulty or what-have-we. Physical failure like this isn't something we can expect software to handle/resolve IMO. Wargames and gremlins combined!
To truly tie all this together I think we have to consider atomicity
> Writing the same data back out should always be possible...
Depends on the implementation: maybe you open a temp file and then mv it into place after you are done writing (for increased atomicity)?
But as I already said, in practice we ignore these externalities because it makes the problem a lot harder for minor improvements — not because this isn't something software can "handle/resolve".
To give a concrete example: some build systems embed a "build number" in the output that increments for each non-clean build (yeah this is stupid but I have seen it).
This is deterministic (doesn't change randomly), but not idempotent.
We already have isomorphic web apps, some day we might see a bundler advertised as "Calabi-Yau with benefits of mirror symmetry and vanishing first Chern class".