Do you mean for administrative access to the machines (over SSH, etc) or for "normal" access to the hosted applications?
Admin access: Ansible-managed set of UNIX users & associated SSH public keys, combined with remote logging so every access is audited and a malicious operator wiping the machine can't cover their tracks will generally get you pretty far. Beyond that, there are commercial solutions like Teleport which provide integration with an IdP, management web UI, session logging & replay, etc.
Normal line-of-business access: this would be managed by whatever application you're running, not much different to the cloud. But if your application isn't auth-aware or is unsafe to expose to the wider internet, you can stick it behind various auth proxies such as Pomerium - it will effectively handle auth against an IdP and only pass through traffic to the underlying app once the user is authenticated. This is also useful for isolating potentially vulnerable apps.
> provisioning and running VMs
Provisioning: once a VM (or even a physical server) is up and running enough to be SSH'd into, you should have a configuration management tool (Ansible, etc) apply whatever configuration you want. This would generally involve provisioning users, disabling some stupid defaults (SSH password authentication, etc), installing required packages, etc.
To get a VM to an SSH'able state in the first place, you can configure your hypervisor to pass through "user data" which will be picked up by something like cloud-init (integrated by most distros) and interpreted at first boot - this allows you to do things like include an initial SSH key, create a user, etc.
To run VMs on self-managed hardware: libvirt, proxmox in the Linux world. bhyve in the BSD world. Unfortunately most of these have rough edges, so commercial solutions there are worth exploring. Alternatively, consider if you actually need VMs or if things like containers (which have much nicer tooling and a better performance profile) would fit your use-case.
> deployments
Depends on your application. But let's assume it can fit in a container - there's nothing wrong with a systemd service that just reads a container image reference in /etc/... and uses `docker run` to run it. Your deployment task can just SSH into the server, update that reference in /etc/ and bounce the service. Evaluate Kamal which is a slightly fancier version of the above. Need more? Explore cluster managers like Hashicorp Nomad or even Kubernetes.
> Network side of things like VNet
Wireguard tunnels set up (by your config management tool) between your machines, which will appear as standard network interfaces with their own (typically non-publicly-routable) IP addresses, and anything sent over them will transparently be encrypted.
> DNS
Generally very little reason not to outsource that to a cloud provider or even your (reputable!) domain registrar. DNS is mostly static data though, which also means if you do need to do it in-house for whatever reason, it's just a matter of getting a CoreDNS/etc container running on multiple machines (maybe even distributed across the world). But really, there's no reason not to outsource that and hosted offerings are super cheap - so go open an AWS account and configure Route53.
> securely opening ports
To begin with, you shouldn't have anything listening that you don't want to be accessible. Then it's not a matter of "opening" or closing ports - the only ports that actually listen are the ones you want open by definition because it's your application listening for outside traffic. But you can configure iptables/nftables as a second layer of defense, in case you accidentally start something that unexpectedly exposes some control socket you're not aware of.
> Monitoring setup across the stack
collectd running on each machine (deployed by your configuration management tool) sending metrics to a central machine. That machine runs Grafana/etc. You can also explore "modern" stuff that the cool kids play with nowadays like VictoriaMetrics, etc, but metrics is mostly a solved problem so there's nothing wrong with using old tools if they work and fit your needs.
For logs, configure rsyslogd to log to a central machine - on that one, you can have log rotation. Or look into an ELK stack. Or use a hosted service - again nothing prevents you from picking the best of cloud and bare-metal, it's not one or the other.
> safely expose an application externally
There's a lot of snake oil and fear-mongering around this. First off, you need to differentiate between vulnerabilities of your application and vulnerabilities of the underlying infrastructure/host system/etc.
App vulnerabilities, in your code or dependencies: cloud won't save you. It runs your application just like it's been told. If your app has an SQL injection vuln or one of your dependencies has an RCE, you're screwed either way. To manage this you'd do the same as you do in cloud - code reviews, pentesting, monitoring & keeping dependencies up to date, etc.
Infrastructure-level vulnerabilities: cloud providers are responsible for keeping the host OS and their provided services (load balancers, etc) up to date and secure. You can do the same. Some distros provide unattended updates (which your config management tool) can enable. Stuff that doesn't need to be reachable from the internet shouldn't be (bind internal stuff to your Wireguard interfaces). Put admin stuff behind some strong auth - TLS client certificates are the gold standard but have management overheads. Otherwise, use an IdP-aware proxy (like mentioned above). Don't always trust app-level auth. Beyond that, it's the usual - common sense, monitoring for "spooky action at a distance", and luck. Not too much different from your cloud provider, because they won't compensate you either if they do get hacked.
> For more context, I worked at a very large hedge fund briefly which had a small DC worth of VERY beefy machines but absolutely no platform on top of it...
No, using Ansible to distribute public keys does not get you very far. It's fine for a personal project or even a team of 5-6 with a handful, but beyond that you really need a better way to onboard, offboard, and modify accounts. If you're doing anything but a toy project, you're better off starting off with something like IPA for host access controls.
Why do think that? I did something similar at a previous work for something bordering on 1k employees.
User administration was done by modifying a yaml file in git. Nothing bad to say about it really. It sure beats point-and-click Active Directory any day of the week. Commit log handy for audits.
If there are no externalities demanding anything else, I'd happily do it again.
There is nothing _wrong_ with it, and so long as you can prove that your offboarding is consistent and quick then feel free to use it.
But a central system that uses the same identity/auth everywhere is much easier to keep consistent and fast. That’s why auditors and security professionals will harp on idp/sso solutions as some of the first things to invest in.
I found that the commit log made auditing on- and offboarding easier, not harder. Of course it won't help you if your process is dysfunctional. You still have to trigger the process somehow, which can be a problem in itself when growing from a startup, but once you do that it's smooth.
However git is a central system, a database if you will, where you can keep identities globally consistent. That's the whole point. In my experience, the reason people leave it is because you grow the need to interoperate with third party stuff which only supports AD or Okta or something. Should I get to grow past that phase myself I would feed my chosen IdM with that data instead.
What's the risk you're trying to protect against, that a "better" (which one?) way would mitigate that this one wouldn't?
> IPA
Do you mean https://en.wikipedia.org/wiki/FreeIPA ? That seems like a huge amalgamation of complexity in a non-memory-safe language that I feel like would introduce a much bigger security liability than the problem it's trying to solve.
I'd rather pony up the money and use Teleport at that point.
> which are technologies old and reliable as dirt.
Technologies, sure. Implementations? Not so much.
I can trust OpenSSH because it's deployed everywhere and I can be confident all the low-hanging fruits are gone by now, and if not, its widespreadness means I'm unlikely to be the most interesting target, so I am more likely to escape a potential zero-day unscathed.
What't the marketshare of IPA in comparison? Has it seen any meaningful action in the last decade years, and the same attention, from both white-hats (audits, pentesting, etc) as well as black-hats (trying to break into every exposed service)? I very much doubt it, so the safe thing to assume is that it's nowhere as bulletproof as OpenSSH and that it's more likely for a dedicated attacker to find a vuln there.
Do you mean for administrative access to the machines (over SSH, etc) or for "normal" access to the hosted applications?
Admin access: Ansible-managed set of UNIX users & associated SSH public keys, combined with remote logging so every access is audited and a malicious operator wiping the machine can't cover their tracks will generally get you pretty far. Beyond that, there are commercial solutions like Teleport which provide integration with an IdP, management web UI, session logging & replay, etc.
Normal line-of-business access: this would be managed by whatever application you're running, not much different to the cloud. But if your application isn't auth-aware or is unsafe to expose to the wider internet, you can stick it behind various auth proxies such as Pomerium - it will effectively handle auth against an IdP and only pass through traffic to the underlying app once the user is authenticated. This is also useful for isolating potentially vulnerable apps.
> provisioning and running VMs
Provisioning: once a VM (or even a physical server) is up and running enough to be SSH'd into, you should have a configuration management tool (Ansible, etc) apply whatever configuration you want. This would generally involve provisioning users, disabling some stupid defaults (SSH password authentication, etc), installing required packages, etc.
To get a VM to an SSH'able state in the first place, you can configure your hypervisor to pass through "user data" which will be picked up by something like cloud-init (integrated by most distros) and interpreted at first boot - this allows you to do things like include an initial SSH key, create a user, etc.
To run VMs on self-managed hardware: libvirt, proxmox in the Linux world. bhyve in the BSD world. Unfortunately most of these have rough edges, so commercial solutions there are worth exploring. Alternatively, consider if you actually need VMs or if things like containers (which have much nicer tooling and a better performance profile) would fit your use-case.
> deployments
Depends on your application. But let's assume it can fit in a container - there's nothing wrong with a systemd service that just reads a container image reference in /etc/... and uses `docker run` to run it. Your deployment task can just SSH into the server, update that reference in /etc/ and bounce the service. Evaluate Kamal which is a slightly fancier version of the above. Need more? Explore cluster managers like Hashicorp Nomad or even Kubernetes.
> Network side of things like VNet
Wireguard tunnels set up (by your config management tool) between your machines, which will appear as standard network interfaces with their own (typically non-publicly-routable) IP addresses, and anything sent over them will transparently be encrypted.
> DNS
Generally very little reason not to outsource that to a cloud provider or even your (reputable!) domain registrar. DNS is mostly static data though, which also means if you do need to do it in-house for whatever reason, it's just a matter of getting a CoreDNS/etc container running on multiple machines (maybe even distributed across the world). But really, there's no reason not to outsource that and hosted offerings are super cheap - so go open an AWS account and configure Route53.
> securely opening ports
To begin with, you shouldn't have anything listening that you don't want to be accessible. Then it's not a matter of "opening" or closing ports - the only ports that actually listen are the ones you want open by definition because it's your application listening for outside traffic. But you can configure iptables/nftables as a second layer of defense, in case you accidentally start something that unexpectedly exposes some control socket you're not aware of.
> Monitoring setup across the stack
collectd running on each machine (deployed by your configuration management tool) sending metrics to a central machine. That machine runs Grafana/etc. You can also explore "modern" stuff that the cool kids play with nowadays like VictoriaMetrics, etc, but metrics is mostly a solved problem so there's nothing wrong with using old tools if they work and fit your needs.
For logs, configure rsyslogd to log to a central machine - on that one, you can have log rotation. Or look into an ELK stack. Or use a hosted service - again nothing prevents you from picking the best of cloud and bare-metal, it's not one or the other.
> safely expose an application externally
There's a lot of snake oil and fear-mongering around this. First off, you need to differentiate between vulnerabilities of your application and vulnerabilities of the underlying infrastructure/host system/etc.
App vulnerabilities, in your code or dependencies: cloud won't save you. It runs your application just like it's been told. If your app has an SQL injection vuln or one of your dependencies has an RCE, you're screwed either way. To manage this you'd do the same as you do in cloud - code reviews, pentesting, monitoring & keeping dependencies up to date, etc.
Infrastructure-level vulnerabilities: cloud providers are responsible for keeping the host OS and their provided services (load balancers, etc) up to date and secure. You can do the same. Some distros provide unattended updates (which your config management tool) can enable. Stuff that doesn't need to be reachable from the internet shouldn't be (bind internal stuff to your Wireguard interfaces). Put admin stuff behind some strong auth - TLS client certificates are the gold standard but have management overheads. Otherwise, use an IdP-aware proxy (like mentioned above). Don't always trust app-level auth. Beyond that, it's the usual - common sense, monitoring for "spooky action at a distance", and luck. Not too much different from your cloud provider, because they won't compensate you either if they do get hacked.
> For more context, I worked at a very large hedge fund briefly which had a small DC worth of VERY beefy machines but absolutely no platform on top of it...
Nomad or Kubernetes.