There have been several comments regarding authentication but what I see as the glaring issue is the lack of authorization. Without some sort of RBAC implemented within Teleport and/or integration with external authorization solutions the use cases for Teleport are going to be limited to very simple scenarios. The suggestion mentioned within a couple of github issues of using os logins as roles is a non-starter in a large number of environments due to security and audit considerations.
I understand that trusted clusters were originally intended as a means to allow Teleport to work through restrictive firewalls but the same type of environments which are going to necessitate the use of trusted clusters are also likely to require more restrictions regarding which users are allowed to access the cluster from a trusted cluster. The current solution of specifying individual allowed users is too limited and doesn't scale.
Gravity, which Teleport is ripped from, is a system to support SaaS providers in deploying and managing onsite implementations within enterprise networks. Maybe these issues are addressed at a different level within Gravity but if not I would be hesitant to allow a SaaS provider to deploy within my network using Gravity.
tssva, great points. Gravity is basically Kubernetes-on-Teleport, and our distribution of k8s has additional security measures built-in, where Teleport (in a library form) is acting more or less on a lowest "root" level which the infrastructure owner can optionally turn on/off for pieces of their infrastructure, we'll be gradually publishing more documentation on Gravity soon, but you're right: RBAC lives one level higher.
It's interesting, but I don't think I'd take it over openssh just yet. I'm not at the scale where most of the features become relevant, and frankly, I don't trust it as much as I trust openssh, a piece of software that's been in active development for almost 17 years, has an excellent security track record, and a development team well known for its care for security and its response to security issues.
And in a piece of vital security infrastructure, trust is everything.
> "Teleport has completed a security audit from a nationally recognized technology security company. So we are comfortable with the use of Teleport from a security perspective."
It would be interesting to read more details about this audit. This blurb is pretty much "Trust us" with more letters.
Having worked on software which received audits from a "natonally recognized security company", this statement on their part gives me no additional confidence. I performed what I would consider a "light" audit on the same codebase and found database injection attacks, auth forging attacks, denial of service; and the infrastructure it was deployed on used versions of libraries which opened the service up to plausible remote code execution.
Ditto, it just basically says "trust us"; no offense to the authors.
Unfortunately, our current agreement with the security company we hired prevents us from sharing the details unless under NDA. We may revisit those terms so that we can publish it now that Teleport is garnering more interest.
Edit: tried to add more clarification to address jtchang's question.
My company pays for an annual external audit, and I do most of the negotiations.
About half the companies don't want the audit results going anywhere else. About half are OK with distribution under NDA. In between, there are companies which will negotiate the right to distribute, usually with an explicit indemnity clause, and on the far end are a minority of companies that tell you that the results are work for hire and entirely your prerogative.
The major concern is that the customer will misrepresent the results in some way, or even represent them accurately but then change the underlying service to open up a vulnerability. Either way, nobody really wants to be in the news...
It's exactly this. A private audit is generally cheap. The truth with these is that the audit really doesn't mean anything - you will often find such audits performing such a superficial sweep of the software that even glaring problems may not be found.
Publicly releasing an audit puts the auditor's reputation on the line. As such, the audit will always be performed very seriously, digging into the code and scrutinizing every possible angle - at a level usually not even remotely done for a private audit.
qwertyuiop924: I'm with you. When it comes to security it pays off to be skeptical and pragmatic. We're on the hook of keeping it robust, simple and secure for years to come. So eventually, with the help from the OSS community, we expect it to earn your trust, hopefully sooner than in 17 years though :)
> And in a piece of vital security infrastructure, trust is everything.
Trust is based on an individual's experience over time. It can also be based on statistical observations over time. Statistically, nearly every single piece of software in use today has had a security hole in it at some point in time and many still have them. Audit or not, openssh is included in this fact. Today's infrastructure software is far too complex to ever be secure past a certain statistical threshold, at least without reinventing the way the Internet works.
Your comments are curious given Teleport just came out. One comes to trust something by spending time with it first. You can't apply anti-trust to something just because it's not the same as something you do trust.
>Your comments are curious given Teleport just came out. One comes to trust something by spending time with it first. You can't apply anti-trust to something just because it's not the same as something you do trust.
I don't follow this line of reasoning. I was with you for the opening sentence - trust is a thing earned over time - and then you lost me.
Statistically it is a near-certainty that OpenSSH does have undiscovered security holes. The trust that the maintainers have built over years makes me comfortable with that, because I know they will be rapidly addressed in a reasoned out manner.
I don't see gp as applying anti-trust just because it is different from OpenSSH - I see it as the maintainers of Teleport not yet having earned the trust that the maintainers of OpenSSH have.
And statistically as a new product with a new code base, Teleport is likely to have more security issues than an established product actively monitored for security flaws in the same space.
So the issue is not a matter of "new and different is bad", so much as "new product and (relatively) unknown community behind it are unproven". Those two factors combined make it inherently less trustworthy in a space where trustworthiness is critical.
The parent post is presenting conflicted information about trust. I find that both interesting and disgusting. I'm disgusted (fear + boredom) with it because I fear for the human race moving forward with existing centralized infrastructure. The boredom is a side effect of it involving trust and me not trusting it. I remain interested in talking about trust, however, so I'm conflicted myself. :)
This is the basic argument they are making:
> I don't trust it as much as I trust openssh...And in a piece of vital security infrastructure, trust is everything.
This is an implicit statement that all infrastructure tends toward full trustworthiness by the increasing use of legacy software. It also implies that trusted infrastructure affects everyone (which is true). Together, this is a logical fallacy - infrastructure will never be 100% trustworthy and when it breaks it will not affect everyone the same. You yourself made that observation regarding holes in OpenSSH. It is anti-trust, or "inverse trust" (if one is a stickler for terms) when one attempts to publically imply their individual/internal trust of a thing is based on their societal/external trust of another thing which is being used in a mutually exclusive way. That itself is an anti-competitive practice, whether intentional or not.
I think calling this flawed logic out is important when the subject is trust and security. There's a reason for the limit of causality in our universe. The lesson we can take from that limit would be to not apply inverse internal trust to something to prevent it's external use when using it is required to trust it.
I fail to see what's wrong with saying that a new piece of software hasn't yet earned the trust that an older piece of software has. However, I never said that it could not earn that trust. That's an assumption on your part.
And I don't distrust Teleport because OpenSSH exists, I distrust Teleport because it hasn't proven its reputation yet. If OpenSSH didn't exist, I'd be saying the same things. The difference is, because OpenSSH does exist, I can make a comparison to piece of infrastructure I do trust.
And I do agree that we cannot move forward with existing infrastructure, I just don't trust this particular piece of new infrastructure yet.
This project was posted here 6 months ago, but has since evolved quite a bit. Version 1.1.0 was recently released, so I personally wanted to hear HN's thoughts on this more stable version of the project. (I am not affiliated.)
ditto. I evaluated it back then and though it could be fantastic at some point, it was just missing to many small things. Admittedly these were things we could have solved ourselves, but it seemed like they would be added to the core product at some point, so I opted to hold off adoption for a while.
Agreed, this looks like a pretty good evolution on plain (open)sshd now.
Still does seem to sit at a somewhat uncomfortable place between being an augmented ssh that's easier to manage, and a completely separate auth/authz solution (in some ways a little like installing saltstack or puppet etc, which essentially grants a daemon arbitrary/root privileges and ability to do configuration changes on the "outside" of the unix user/group-framework).
Things like:
"Also, people can join your session via terminal assuming they have Teleport installed and running. They just have to run:
!!! tip "NOTE": For this to work, both of you must have proper user mappings allowing you access db under the same OS user." (my emphasis)
I don't really see this as a big problem -- but I'd prefer a tool that basically magically took care of generating short-lived user-certificates for ssh (and it does indeed look like teleport does this now, in a rather nice and well-documented[1] manner) -- but I'd like to keep the user-database in some sane place, like ldap, especially as long as teleport use is still subject to pam/unix user/group in addition to internal user/authorizations.
I also don't particularly like having to use password login (even with two-factor) - but I suppose one could either a) wrap the web bit in ssl client auth (but now you need two CAs, one for ssh certs and one for x509) - or better (for my use-case) create a daemon that works as a ssh daemon and can be configured for normal ssh access (set of known keys, limit access to internal network/whitelisted IPs) -- that can be used for initial auth. Or maybe just replace the current web auth with a specialized ssh daemon that uses standard ssh+two-factor login (something like[2]).
All that said, I really like the look of this project in its current state, and I'll strongly consider using it over setting up a "bare" (open)ssh CA system.
[2] Note, I'm not a great fan of the google auth pam module, but have successfully tested using OAUTH -- and it works well both with pre-generated passwords or google auth or similar TOTP-apps. An added benefit is that it can also be enabled for sudo/su elevation, rather than just relying on password for that.
Incidentally my fight with OATH/pam/various python tools to generate qr-codes allowed me to overcome the seemingly insurmountable challenge of pairing the standard Google Authenticator App with the Star Wars: Old Republic TOTP secret in order to enable 2-factor login. It appears that makes me a member of a rather select club of swtor-players....
Why is ssh even needed on distributed clusters? Shouldn't provisioning the cluster be automated and the nodes be immutable by design? I can only imagine what a nightmare a huge fleet of special sniwflake machines woukd be to manage. Cattle not pets
I firmly believe No-SSH is a goal you should always strive for but never actually achieve. There are always cases where you need to do really detailed troubleshooting that requires things like tcpdump, a debugger, or even running lsof (or other expensive command that you can't afford to run regularly and log).
Some random user with some edge case will put one of your nodes for your service into a bad state. Throwing away the machine will just lose the state and push the user somewhere else.
Also centralized log and metrics stores get real big and expensive really fast. Sometimes you simply can't afford to ship everything to it. So you'll find yourself putting detailed debug logs on the "free" ephemeral/local disks while info logs go to your central store.
Direct SSH access can be an invaluable tool for debugging production issues. It is great to be able to SSH in and check the tcp dump, view the running processes in detail, attach gdb and get a memory dump. These types of tasks will never go away. We have great leaps in orchestrating remote application management but when you get down to issues at the bottom of your stack you'll always need to directly access the machine. I like to tell my developers that "the stack goes all the way down". A bug in Linux networking or even a CPU error is still a bug.
Yep, sometimes you have to debug things live with things like perf/gdb echo c > /proc/sysrq_trigger.
If not, well congratulations, but if I got told I can't get the above to debug things I'd start question why this "pet" infrastructure lacks basic debugging ability.
Smart health checks and logging should take care of that and remove the instance automatically. You can also spin up a canary machine to "live" debug. I'm referring to distributed clusters not a single machine taking on all the traffic.
Smart health checks and logging should take care of that and remove the instance automatically.
What if the bug is corrupting data or sending incorrect results to the client? Even if you detect and kill the instance, you still have a problem to fix. And even if you can cleanly kill the instance and redirect the request, you can't avoid the latency hit from having to re-process it.
You can also spin up a canary machine to "live" debug.
You can reproduce the code and data, but how can you reproduce the exact state? You can't log everything that happens in a machine.
This is a bit naive. If you've ran a production system with any decent traffic and never needed to SSH into machines, congrats. I haven't and I don't know anyone who has. You might need to go in for anything from auditing to troubleshooting, even if it's rare.
PS: How is your automated provisioning system reaching your cluster if not by SSH?
Saltstack is either using SSH to communicate or opening it's own port. I'd much rather trust an open ssh port for securely provisioning/management than allow any other piece of software to keep a port open (upto and including TLS based protocols).
SaltStack has an agent that communicates with a master on a different server. The agents on the clients don't need an open port (other than egress).
This allows me to have one central server that is well secured and protected that allows ingress from the remote hosts, and then all the clients reach out to the master to get their tasks.
logging and metrics are sent elsewhere to be consumed and queried.
I build machine images with packer that get provisioned during the deployment pipeline. That single machine is then put into a cluster with x number of copies. If one dies I don't care, the cluster provisions another automatically.
>PS: How is your automated provisioning system reaching your cluster if not by SSH?
Not sure about moondev, but Terraform + Cloud-init + Container Orchestrator means that I basically never need to SSH into my nodes, except in extreme/rare circumstances.
I said "I basically never need to". Not that I never need to. But yeah, I basically never, ever need to. Short of needing to take a coredump or docker shits the bed, I don't really ever need to log into my nodes.
I guess that's an offensive thing to point out judging by my score...
Let's say you have a thousand machines sitting around, doing whatever, and suddenly you notice one of them acting strangely. Maybe it's just a little slower than the others, maybe it's crashing, maybe your monitoring system can't pick it up, maybe it's even producing incorrect outputs. How do you tell if it's:
A) A hardware issue (thus requiring hardware maintenance)
B) A software issue triggered by the particular workload of this machine (thus requiring a software change)
C) A network issue (thus requiring network maintenance, possibly phone calls to ISPs, etc)
D) Something else (thus requiring who knows what)
The most normal way in the real world is, have a sysadmin and a dev sit down together, SSH into the machine, and poke things until they arrive at the root of the problem (possibly with additional process involved if it's a really big system).
Then, through normal (non SSH) processes, perform the required maintenance and update your monitoring system so that it identifies problems of that class.
E) Your health checks notice the machine is acting up and it is removed from the cluster and another provisioned. In a distributed environment all of your nodes should be designed to be immutable and stateless. That's the advantage of running them at scale horizontally
So your plan is to buy a new machine as a result of a software issue, a network issue, or another problem that has nothing to do with the hardware like a corrupted image?
When you have 100 machines, you can probably afford to do this. It would be silly, because when you only have 100 machines, most of your problems will be software or network issues. Still, you can do it and it probably won't cost you a kidney. When you have ten thousand, you can't.
And this is even assuming that your monitoring system notices the problem. All too frequently, you'll notice the problem in some other way. Monitoring systems are designed to detect the things you knew might happen, and problems similar to those. There will be blind spots, it's just unavoidable.
Then you notice you're churning through machines pretty often and start to wonder if maybe you should actually fix that bug instead of just ignoring it.
We aren't talking about using SSH to go in and fix a machine so things stay up. It's about figuring out why something is happening.
When all of your nodes are immutable and stateless, you have a really tough time doing logging, new account signup, and storing any information at all on behalf of your customers.
So not all of the nodes can be immutable and stateless.
In fact, there is no such thing as an immutable or stateless computer; it violates the theory of computing. Just a bunch of buzzword nonsense people say so that they can justify spending their entire life asking for too much money to try making the machine behave as though it is stateless.
With real, physical machines; you could have a cooling issue (which you should fix), a manufacturing defect (which you should RMA), a networking issue (which you should fix). All of these can easily be unique to an instance, and all of them need analytical attention; especially when you're paying for all of those machines "horizontally" and "at scale".
> In fact, there is no such thing as an immutable or stateless computer
Nobody calls the computer immutable/stateless. Immutable/stateless is an architecture pattern, implying that nothing important is stored or shared between calls.
It doesn't say that you can't cache info or write log files, etc.
> Just a bunch of buzzword nonsense
Well, what's the alternative that you propose? Should my HTTP requests leave lots of state-full files laying around? Is it easier to manage deployments when every box has different files laying around depending on how old it is?
Also, remember these techniques are for people "at scale". If you only have 1 or 2 boxes, these techniques may not be useful to you, so feel free to ignore them. I've got a 1000 boxes, and have seen first hand why it's a bad idea for each box to be able to be a unique snowflake.
Even if you have physical machines, you shouldn't put your production image on it to debug it -- you should put a debug image on it that allows SSH. If it's a hardware problem, it will be easy to find. If it's a software problem -- well, you have 1000 other boxes running that software, so you'll see it again. Just add better logging.
I mean, you can have an immutable stateless computer, but it's just called a "boolean logic circuit". You give it an input, it gives you an output, that's all.
In practice, you'll of course actually be using a computer with RAM. Most likely, it will even have a Von Neumann architecture. That leaves you open to everything from race conditions to hardware failures to cosmic rays putting a machine into a bad state.
Stateless depends on the context, it doesn't violate any theory of computing. Here are some elements of stateless systems:
* all the state needed for serving requests is transient, not something that needs to be persisted
* any particular instance can disappear without affecting the observable behavior of the system
* the system is infinitely scalable: it doesn't matter if the cluster is comprised of 1, 100 or 1000 instances
Even Netflix, the poster child for this style of managing systems, allows engineers to SSH in to machines. In fact, their system and Facebook's system for performing SSH auth were discussed on HN recently.
So Teleport always records the SSH session? Doesn't that get expensive at times? Sometimes I stream logs over SSH or use `watch` on a fast interval. It's enough that tmux and ssh take a non-trivial amount of CPU, even just on the receiving end. I just wonder if that proxy recording becomes a bottleneck.
Teleport is actually a Golang library, and that's how it's used internally at Gravitational, where session recording can be turned on or off [1] based on a use case.
The pre-built teleport daemon as you see in `tool` directory does not have a config switch for turning recording off yet, we should add it.
No support for "plugins" as far as authentication, or to elaborate a bit, how do you go about running the 'auth' component in multiple VPC and have some degree of sync? Perhaps a use case for an underlying LDAP directory, or ..?
We support OIDC connectors, so you can plug in LDAP using https://github.com/coreos/dex as one of the providers, or simply roll a new OIDC provider customized to your needs.
Why do these fucking quasi-products never actually say what the fuck they DO? "<this> replaces sshd, possibly the most critical piece of software your server runs, with a magic black box".
There's nothing in the rather meagre list of suggestions at the top of the readme that can't be done already and 0 detail on how they intend to achieve it and why their dubious methods require wholesale replacement of your, did I say it's critical?, ssh daemon.
SSH is definitely not a tool I'll be replacing any time soon with the latest fly-by-night product from CADT BikeShedders Inc.
There was one very interesting feature, recording of sessions.
There is an increasing demand in large organisations to manage server sessions better.
That is to have accountability and manage access centrally.
So far I've only reviewed one solution, Cyberark, they promised to manage all our server sessions centrally with 2FA, recording ssh and rdp, and more. But in the end we weren't motivated enough to pick them.
I haven't seen any open source offering to do this.
Of course this product would be outside of your normal SSH server, as an additional proxying layer and not a replacement for ssh. Just saying that there was one very interesting point in that repo, I wouldn't replace OpenSSH ligthly either.
I understand that trusted clusters were originally intended as a means to allow Teleport to work through restrictive firewalls but the same type of environments which are going to necessitate the use of trusted clusters are also likely to require more restrictions regarding which users are allowed to access the cluster from a trusted cluster. The current solution of specifying individual allowed users is too limited and doesn't scale.
Gravity, which Teleport is ripped from, is a system to support SaaS providers in deploying and managing onsite implementations within enterprise networks. Maybe these issues are addressed at a different level within Gravity but if not I would be hesitant to allow a SaaS provider to deploy within my network using Gravity.