A thing that this article is hinting at that I think might be more fundamental to making good automation principles: idempotency.
Most of unix's standard set of tools (both the /bin programs and the standard C libraries) are written to make changes to state - but automation tools need to assure that you reach a certain state. Take "rm" as a trivial example - when I say `rm foo.txt`, I want the file to be gone. What if the file is already gone? Then it throws an error! You have to either wrap it in a test, which means you introduce a race condition, or use "-f" which disables other, more important, safeguards. An idempotent version of rm - `i_rm foo.txt` or `no_file_called! foo.txt` would would include that race-condition-avoiding logic internally, so you don't have to reinvent it, and bail only if anything funny happened (permission errors, filesystem errors). I does not invoke a solver to try to get around edge cases (e.g., it won't decide to remount the filesystem writeable so that it change an immutable fs...)
Puppet attempts to create idempotent actions to use as primitives, but unfortunately they're written in a weird dialect of ruby and tend to rely on a bunch of Puppet internals in poor separation-of-concern ways (disclaimer: I used to be a Puppet developer) and I think that Chef has analogous problems.
Ansible seems to be on the right track. It's still using Python scripts to wrap the non-idempotent unix primitives - but at least it's clean, reusable code.
Are package managers idempotent the way they're currently written? Yes, basically. But they have a solver, which means that when you say "install this" it might say "of course, to do that, I have to uninstall a bunch of stuff" which is dangerous. So Kožar's proposal is somewhere in the right direction - since it seems like you wouldn't have to ever (?) uninstall things, but it's making some big changes to the unix filesystem to accomplish it, and then it's not clear to me how you know which versions of what libs to link to and stuff like that. There's probably smaller steps we could take today, when automating systems. Is there a "don't do anything I didn't explicitly tell you to!" flag for apt-get ?
The article is hinting at referential transparency for packaging and configuration.
> it's not clear to me how you know which versions of what libs to link to and stuff like that
You'd typically link to the most recent version which you've tested against, and record it's base32 hash in your package definition. That is, a package by default contains exact identities of all of its dependencies - there is no "fuzzy matching" of packages based on a name and version range. The point here is that the packager of the application should know what he is doing, and by specifying exact dependencies, he is removing the "hidden knowledge" that often goes into building software. (in many cases, this is just ./configure && make && make install, but can be massively more difficult to reproduce a build, particularly if the dependencies aren't well specified.
The Nix build system knows which version to build against because there is only one version to build against in the chrooted environment where the build occurs - which is the one whose identity you specified in the nixpkg.
> there is only one version to build against in the chrooted environment where the build occurs
This is all rather new to me. Would it be fair to make the analogy? The build process is not a portable/cross-platform event, so you basically distribute a BuildFoo.exe with statically-linked libraries included.
You're roughly guaranteed that the BuildFoo.exe will run (they've got those libraries), and the user gets Foo in the end (either dynamically-linked or statically).
Yes, it's a fair analogy. Nix doesn't require static linking, but it does require the exact dependencies to be present for shared libraries. You can run the Nix package manager on top of another system like Debian, but you'll need to build most of the core packages again with Nix, such as glibc, gcc etc. (they live alongside your system's packages in /nix/store, and can be linked in /usr/local). This basically works as long as the kernel you're running supports the features of the packages you install.
With the NixOS you get the additional advantage of configuration management and everything, including the kernel is handled by the package manager, which providers stonger guarantees that things should work as expected.
Everything is reproducible. Things that have no reason to be tangled up are, in fact, not tangled up. If that doesn't sound advantageous, I don't know what else can be said.
If that doesn't sound advantageous, I don't know what else can be said
I mean specifically with regards to configuration management: that is, managing the part of software that developers intend to be modified so as to change the behavior of the program.
Maybe I just don't understand, but I don't see how this does anything to advance current config management dilemmas like how to merge a new upstream version of a configuration file with your site-specific changes; or how to deploy similar changes to large numbers of nodes at a time.
Modifying files in a git repo which are deployed to $ETC by ansible where modification triggers versus modifying files in a git repo which are used as "inputs" to a functional operating system seem like a largely cosmetic difference to me.
Offtopic, but: what's an example of a situation where using rm -f is bad compared to rm in practice? That is, an example where rm would save you but rm -f would make your life upsetting?
On topic: idempotency may be a red herring in this context. Unfortunately filesystems are designed with the assumption that every modification is inherently stateful. (It may be possible to design a different type of filesystem without this assumption, but every filesystem currently operates as a sequence of commits that alter state.) So installing a library or a program is necessarily stateful. What do you do if the program fails to install? Trying again probably won't help: the failure is probably due to some other missing or corrupted state. So indempotency won't help you because there's no situation in which a retry loop would be helpful. That is, if something fails, then whatever operation you were trying to accomplish is probably doomed anyway (if it's automated).
I think docker is the right answer. It sidesteps the problem by letting you create containers with guaranteed state. If you perform a sequence of steps, and those steps succeeded once, then they'll always succeed (as long as errors like network connectivity issues are taken into account, but you'd have to do that anyway). EDIT: I disagree with myself. Let's say you write a program to set up a docker container and install a web service. If at some future time some component that the web service relies upon releases an update that changes its API in a way that breaks the web service, then your supercool docker autosetup script will no longer function. The only way around this is to install known versions of everything, but that's a horrible idea because it makes security updates impossible to install.
It's a tough problem in general. Everyone agrees that hiring people to set up and manually configure servers isn't a tenable solution. But we haven't really agreed what should replace an intelligent human when configuring a server.
well, the rm example is overly simple on purpose - the only thing that -f is actually going to do that's remotely dangerous is removing files that have the readonly bit set. I've never actually been bitten by that. In general though, I think this pattern scales poorly - the more complicated your task is, the more like the "force it" mode is going to be more and more dangerous.
---
On the subject of what to do when something goes wrong:
Sometimes retrying installing a package does fix the problem: if there was a network error, for example, and you downloaded an incomplete set of files, the next time you run it it will be fine.
If your package manager goes off the rails and gets your system into an inconsistent state, then you have a decision to make. Is this going to happen again? If not, just fix the stupid thing manually: there's no point in automating a one-time task. If it is probably recurring, then, you need to write some code to fix it (and file a bug report to your distro!). I do not believe that there is a safe, sane way to pre-engineer your automation to fix problems you haven't seen yet!
In the meantime maybe your automation framework stupidly tries to run the install script every 20 minutes and reports recurring failure. The cost of that is low.
Docker is awesome, for sure, and I'll definitely use it on my next server-side project. It isn't a magic bullet, though - you still have to configure things, they still have dependencies. Just, hopefully, failures are more constrained.
---
and on the point of upgrading for security fixes: the sad reality is that even critical fixes for security holes must be tested on a staging environment. No upgrade is ever really, truly guaranteed to be safe. I guess if the bug is bad enough you just shut down Production entirely until you can figure out whether you have a fix that is compatible with everything.
well, the rm example is overly simple on purpose - the only thing that -f is actually going to do that's remotely dangerous is removing files that have the readonly bit set.
Since you originally outlined the requirements as:
Take "rm" as a trivial example - when I say `rm foo.txt`, I want the file to be gone.
then the file should be gone even if "the readonly bit" was set.
This is not only a contrived example, but a bad one, for system management. rm is an interactive command line tool, with a user interface that is meant to keep you from shooting yourself in the foot. rm is polite in that it checks that the file is writable before attempting to remove it and gives a warning. System management tools I would expect to call unlink(2) directly to remove the file, which doesn't have a user-interface, rather than run rm.
However, the system management tool doesn't start with no knowledge of the current state of the system, but rather one that is known (or otherwise discoverable/manageable). And then attempt to transform the system into a target state. They can not be expected to transform any random state into a target state. As such, the result of unlink(2) should be reported, and the operator should have the option of fixing up the corner cases where it is unable to perform as desired. If you've got 100 machines and 99 of them are able to be transformed into the target state by the system management tool and one of them is not, this isn't a deficiency of the system management tool, but most likely a system having diverged in some way. Only the operator can decide if the divergence is something that can/should be handled on a continuous basis, by changing what the tool does (forcing removal of a file that is otherwise unable to be removed, for example), or fixing that system, after investigation.
The other option is to only ever start with a blank slate for each machine and built it from scratch into a known state. If anything diverges, scrap it and start over. This is an acceptable method of attack to keep systems from diverging, but not always the pragmatic one.
it's probably safer to just remember to use mv instead, because there's a very high chance that you'll do the wrong thing on a terminal that doesn't have that alias available.
> Take "rm" as a trivial example - when I say `rm foo.txt`, I want the file to be gone. What if the file is already gone? Then it throws an error!
Simply rm the file and handle the particular error case of the file existing by ignoring it. Other errors go through fine.
I've been doing this at work to try to wrangle a sense of control out of our various projects. I'm using Sprinkle, which is basically a wrapper around SSH.
What I'm finding is that most decent projects include idempotent ways to configure them. Apache, for instance, at least on Ubuntu allows you to write configs to a directory and then run a command to enable them. Sudo also has the sudo.d directory, cron has cron.d. Just write a file.
> Is there a "don't do anything I didn't explicitly tell you to!" flag for apt-get ?
I would consider this to be overly tight coupling. We should let dpkg manage the OS packages, and if the system's state needs to be changed, you can simply re-build it and run an updated version of your management scripts.
You don't really want to start getting into the game of trying to abstract over the entire domain of systems engineering. CM, in my opinion, should solve one and only one problem, moving system state between the infrastructure/cloud provider defaults and a state where application deployment scripts can take over. Every necessary change to get from point A to point B gets documented in a script. There are only two points on the map, and only one direction to go.
So CM is a provisioning tool? I thought of it as being more of "ensure trusted compute environment" tool. But all the existing tool sets require additional engineering to revert changes that aren't in their dynamically rendered file set.
>Simply rm the file and handle the particular error case of the file existing by ignoring it
How do I differentiate to ignore one error and not others? Matching a string? What if this is supposed to be portable? Hard code strings for every version of rm ever made?
>What I'm finding is that most decent projects include idempotent ways to configure them
Those are modifications debian makes. Lots of software supports including files, which lets debian do that easily. But sudo has nothing to do with you having a sudo.d directory, that is entirely your OS vendor. And having that doesn't solve the problem. What happens when I want to remove X and add Y? You need to have the config be a symlink, so you can do the modifications completely offline, then in a single atomic action make it all live.
Your configuration management is going to have to be OS-dependent. Nothing is going to be so portable that you'll be able to use the same commands on different distros. POSIX is too leaky an abstraction to rely on.
I'm not sure if you are agreeing or disagreeing. Configuration management tools already exist that work across multiple operating systems. You can't rely on posix, but you also can't rely on anything else. There's no standard, sane way to get "what error happened" information from typical unix tools.
idempotency is a nice goal, i tend to run into issues where say somebody changes a chef attribute and re runs chef-client (update) on the machine. Say that was a filepath that got changed. Without knowing about the previous filepath, the only thing that can be done is to work with the new path. its technically idempotent in that if i run it twice without a config change it will not change anything on the second run, but unless on every attribute/recipe change i throw away the old machine and provision a new one there is left over state. That being said, i recreate instances fairly regularly as I believe there are always chaos monkeys lurking :)
Most of unix's standard set of tools (both the /bin programs and the standard C libraries) are written to make changes to state - but automation tools need to assure that you reach a certain state. Take "rm" as a trivial example - when I say `rm foo.txt`, I want the file to be gone. What if the file is already gone? Then it throws an error! You have to either wrap it in a test, which means you introduce a race condition, or use "-f" which disables other, more important, safeguards. An idempotent version of rm - `i_rm foo.txt` or `no_file_called! foo.txt` would would include that race-condition-avoiding logic internally, so you don't have to reinvent it, and bail only if anything funny happened (permission errors, filesystem errors). I does not invoke a solver to try to get around edge cases (e.g., it won't decide to remount the filesystem writeable so that it change an immutable fs...)
Puppet attempts to create idempotent actions to use as primitives, but unfortunately they're written in a weird dialect of ruby and tend to rely on a bunch of Puppet internals in poor separation-of-concern ways (disclaimer: I used to be a Puppet developer) and I think that Chef has analogous problems.
Ansible seems to be on the right track. It's still using Python scripts to wrap the non-idempotent unix primitives - but at least it's clean, reusable code.
Are package managers idempotent the way they're currently written? Yes, basically. But they have a solver, which means that when you say "install this" it might say "of course, to do that, I have to uninstall a bunch of stuff" which is dangerous. So Kožar's proposal is somewhere in the right direction - since it seems like you wouldn't have to ever (?) uninstall things, but it's making some big changes to the unix filesystem to accomplish it, and then it's not clear to me how you know which versions of what libs to link to and stuff like that. There's probably smaller steps we could take today, when automating systems. Is there a "don't do anything I didn't explicitly tell you to!" flag for apt-get ?