I’ve found shared app, shared database completely workable by utilizing Postgres’ row level security. Each row in any table is locked by a “tenant.id” value matching a tenant_id column. At the application level, make all requests set the appropriate tenant ID at request time. You get the data “isolation” while using the simplest infrastructure setup.
There's another benefit to using a tenant_id. You can partition your tables using Postgres partitioning. That keeps the tenant's data together and keeps queries fast.
It's also amenable to distributed processing if you use something like Citus.
I'm really looking for an alternative to Citus, though. Citus itself is a bit tricky to use, and the SaaS version of it is owned by Microsoft, which means Azure-only. Also, Microsoft makes the SaaS version insanely expensive. If they had Citus-like features on Amazon Aurora I'd be there in a heartbeat.
Some things were mistakes but others look like pretty fundamental flaws. Performance is a problem, database changes are a problem, and you aren't able to query across tenants.
this is essential the key (haha) to all shared/shared scenarios, regardless of what tech they are implemented with. The challenge can be migrating from single tennant to this could be fairly impactful, depending on how you built your original solution.
We have about 6 physical DB servers, each client has their own db/schema on one of those boxes. Several thousand clients each with 10s of users per client. It brings in $15m ARR.
So, it works.
We're wanting to move off that architecture to something more future proof, but it's not our biggest pain point at this point in time.
The issue I've seen with this is that to get true security you need 1 db user per tenant. That makes connection pooling difficult and presents its own security issues on limiting the tenant to only their DB user.
If you do it that way then you don't gain much security. Any SQL exploit would just need to add the Set Local Role to break out of the tenant row level security. Any code error would (probably) still allow unauthorized access because that error will likely also set the incorrect user.
It adds a layer of security so it might prevent some bugs leading to exploits. But in itself is not enough to rely on to separate tenants.
Well if you have SQL injection bugs then you have bigger issues to worry about - I've used this to enforce multi-tenancy on database access level (like another poster said - preventing queries accessing wrong data by accident, which is far more common I think).
True, I'm just not sure that I'd trust the DB isolation once the user has SQL injection. I never saw a SQL injection report on a project (well since the PHP days) ORMs solved this for the most part, but I did see multiple instances of accidental data leaks from bugs on different projects.
It looks like you could also use SET SESSION AUTHORISATION for this but I haven't used it so I don't know how this works with data access/pooling
If you are running a copy of the same software for each tenant anyways it doesn't matter much as a SQL injection for one tenant is most likely available on all tenants.
I think for this use case security is focused on accidentally returning the wrong tenant's data (fully or partially)
Thanks, this really helped me understand how row level security can be implemented effectively to partition tenants. It probably seems an obvious idea to many, but I appreciate it nonetheless
This works and solves a lot of problems. The downside is that schema changes are cumbersome because you have to make them in many places. If you want to roll out a new feature in a shared app which depends on a schema change, it's hard to do without downtime or complicated feature flags.
Ideally setting the tenant context happens early during request authorization, is a required to get access to a database connection, and is configured outside the scope of any request business logic.
Same. This is the most sane way, in my experience. It’s pretty easy to move a tenant out of this model and into isolation if needed (never had to do it, but I dry-ran it). It’s harder to go the other way. Deployments, total system queries for analysis, etc are all simpler with this approach.
Pretty telling that the top two comments at the moment endorse the two most diametrically opposite approaches possible: single app and single set of tables with row level security, vs totally separate apps + separate dbs entirely.
I think it really depends on the kind of application and the kind of user / customer you're dealing with. I'd probably lean towards "single" database with horizontal sharding .
Yup, we use “single tenant” as a major upsell to enterprises, and just run a kubernetes cluster for each of these, with their own maintenance / release cycles.
But we can only do that because they pay a lot of money for this stuff. Our “shared, multi-tenant” environment is an order of magnitude cheaper.
We charge about 10x for enterprise customers.
The only difference between what they get and what a normal customer get is that we deploy the enterprise customers on their own servers/network.
Considering how long end exhausting the sales process are with enterprise customers I think the price is fair, even though the extra cost to operate that kind of customer is negligible.
I'd add that the minimum bill is $10000 for anything enterprise. That's a bare minimum to handle all the time and extra work they will incur (months of sales process, dedicated AWS instances, heavy support burden, single sign on integrations, etc...).
As soon as you hear any of these, it's the hint you're dealing with enterprise and you have to up the price tenfold.
Good question. It does, in the way that we use the same software everywhere, the same docker containers and everything.
On the other hand, it takes a lot more effort to manage releases. I guess it scales in the same way Microsoft’s enterprise business scales; hard labor on top of a scalable platform.
Why is the shared app & shared database the most complex beast? Technically it's the simplest and requires the least amount of work to set up and maintain. You can even run the app & db on the same server. The complexity of maintaining multiple app & db instances isn't worth it, at least when you're starting out.
Everything is simple when you're starting out. Over time, you'd see that users can have wildly different load patterns, which will affect how you scale and shard your database over time. It's a huge pain to manage and grow.
It's way easier to scale a shared app/db, anyone who tells you different hasn't had to do devops.
Rule No. 1, never shard by account_id, Pareto is just waiting you kick you in the ass. Always shard by whatever gives the best distribution of workload.
I've always considered data isolation as the #1 priority when it comes to tenancy. Does that mean that the sharding should defacto be done on the tenant ID?
at a certain scale you have to assume that at any given time at least one of your users is guaranteed to be doing something dumb that will cause performance issues. if you haven't implemented rate limiting or request prioritization at every layer of your system in a fully shared environment this will break everything for everyone instead of just breaking it for the one bad actor. having fewer shared components means having fewer ways for this to go wrong. even multiple app instances and multiple db instances on the same machine isn't necessarily enough if you don't do proper isolation with containers or vms due to noisy neighbor issues. and don't even think about sending requests from different users to the same hdd because there is almost no way to isolate them appropriately, ssds make this so much easier.
It's the least flexible strategy _because_ it's the simplest - as soon as a customer comes knocking requiring data isolation, you have a hard business decision to make. It's also a lot harder to extract / deal with a business critical "hot" customer in either direction (either a very performance sensitive customer key to revenue, who could really use their own hardware, or a very bad customer with outsized resource consumption).
Writing a shared app shared database tenant isolation database middleware for Django was one of the most interesting challenges I’ve had over the years. The library was 100% tested with Hypothesis to randomise the data, and used ULIDs to allow for better long term tenant sharding and since ULIDs are compatible with UUIDs they can be dumped/propagated into other systems for analysis/analytical queries. It was quite a lesson in what 100% test coverage does not actually prove since I still had bugs at 100% coverage that took work to chase down, side effects, false positive/negatives, etc.
I lead a service that's used by about 30 airlines and did pretty much exactly the same thing. Django middleware, ULIDs, and using hypothesis for testing. I also extended the django ORM to help restrict what sort-of queries are even possible and to improve error logging for our use cases. As an extra bit of security and logging, ICAO codes everywheeeere. It's a good thing we don't sell to non ICAO airlines.
I can also anecdotally confirm the statement in another user's comment about 1 tenant using 60% of everything. And it's not even a large airline! This one relatively small carrier uses more of our resources than AA, UA and a few of the other top 10 global carriers combined.
Thanks for sharing your experience. Mind sharing the details of the bugs and the middleware you were writing. IS that an open-source library that I can take a peak into?
I ran into so many interesting bugs, I’ve honestly forgotten most of them, but I do remember many of them were actually to do with extensive use of UUIDs and the interaction it had with more powerful Django ecosystem tools that did things like expose a complete JSON-API endpoint, stuff that did “query building” was generally not designed with accounting for. My favourite is having to selectively use thread locals to disable the query filtering layer for very specific internal queries that a few libraries used. I tried to get it working with context vars for future Django async support but never had the time to finish that work.
The code was written with open sourcing it in mind actually. The amount of effort involved pushed me to select License Zero effort and specifically the Prosperity Public License https://prosperitylicense.com (License legal TLDR is “open” but if you use it to make money you have to pay me something I agree to) I’m hoping to do a significant refactoring before I mark the 1.0 version. I have some ideas that may be vastly cleaner but really requires nailing the whole thread local / async context vars lifecycle so I’ve kind of tried to limit who might start relying on it based on how extensively tested it is.
I read multiple articles on the topic years ago, dealt with and designed multi-tenant systems, and my approach is very simple for small developer teams: separate databases, separate app instances running on separate domains for every tenant. This is the least technically complex to implement and there are very few mistakes you can make (only deployment). There are a lot of tools nowadays which can help you automate and isolate the environment for these (Docker, Ansible, whatever). Also it can be the most secure architecture of all.
How is that multitenant? That's just running a separate instance of application for each tenant.
I agree that it is a completely valid approach, especially if you can afford it, but in my opinion it is just a single tenant application that can be easily spun up. For example, what if you want to manage all of your "tenants" within an admin interface. With a separate instance for all your tenants this is not really possible without additional development or a completely separate admin application.
I agree with you here. We typically only advocate for this if the budget/time is low and the likelihood of a next tenant is not soon. Then we learn all we can on the first couple clones and then move to a real multi tenant design while also handling some tech debt we created the first time around.
What tends to happen is a first immediate client is needed along with a couple of sales demo clients. This buys time for the business to see if it’s viable before we bring on complexity.
But I agree clones of a core system isn’t multi tenancy.
For what it’s worth I tend to tell clients we won’t be sustainable after 4-5 of these such clones.
For people considering the clone approach let me caution one issue that comes up about 90% of the time. A client will as for a custom feature and be willing to pay for it for that “one” clone. These divergences are allowed to happen because we aren’t multi-tenant yet. You need to work hard to educate during this phase so you don’t create too much work down the line when you finally consolidate and refactor.
Or move to K8S and put your low volume tenants on cluster A, and high volume tenants on cluster B.
Devops it so that you can create new tenants with a script (in fact a script that is run by the signup codebase) that starts them out in cluster A.
Migration to cluster B would definitely be involved if you're not using a shared database - so potentially be ready to know your potential clients up front to get them on the correct cluster ahead of time.
K8s has proven to be a massive endeavor everywhere I’ve seen it implemented.
I’ve often had teams of 2-6 to handle every technical aspect of a company with millions of visits. That includes feature development and support.
K8 seems to need a large team just to keep it running.
I've had the opposite experience with K8S and docker in general. What was once pretty much always failing deployments, keeping software up to date incorrectly, K8s and docker in general keeps the burden of keeping a site running at an all time low in my lifetime of development. Like from teams to just 1 guy.
LoL, for the way you finished your point. Good to know your experience :)
Did Amazon mere said, "you are doing it wrong", without an explanation? In retrospective, do you think you could have done better? I'm looking to hear the lessons you learned, so may be, I don't end up making that mistake ;)
I guess the way they want you to do it in S3 is using a per tenant key prefix with security policy set? AWS roles and security policies are quite flexible so I guess in theory you would get the desired isolation that way. Separate buckets are still easier, though...
This strategy is really hard to make work with 100's or 1000's of tennants because of the resource overhead and scaling issues, even with virtualization. It makes integration a hard problem if you sell a modular or component-based suite of options as well.
One aspect left out is upgrading: can you always distribute new features to all tenants at the same time? If a new feature requires some training or organizational change, then you need to deploy at the moment agreed on with the tenant. From this point if view, models 1. and 3. are viable.
If extensions are rare, you can keep a switch for each and separate upgrades on model 4. However, if you want changes on switch, then you have to keep old code in conditional branches of your code forever. High cost of ownership.
A colleague implemented a hybrid system for a SaaS product he was leading. Normally the product is a single app, single database with column as a tenant ID discriminator, but for specific tenants he built in an option to specify a separate database. This allowed all tenants that wanted higher performance or data saved in a separate location to be able to buy in, while most of the tenants were in a multitenant database.
Solution I implemented on my project (IoT) was shared apps, shared database. When we were starting the project, we decided to use a column as a discriminator and designed the system so that developers for entities that need to be tenant specific, just need to extend an abstract class, the rest of the system detects when you are trying to save or load such an entity and in those cases it applies a filter or automatically assigns the tenant ID. This means that normal developer can work just like he would on a single tenant application. I feel this is pretty normal stuff.
In practice, I've seen both shared app, shared DBs, and shared app, separate DBs. "Shared / separate DBs" is not actually so black and white. I recommend making your system configurable so you can dedicate DBs to a specific tenant (or group of tenants) if needed. Most of them probably won't need it...
The business model can be the overriding factor. If you're selling to Fortune 500 firms, they will require you to allocate them their own database and application servers. But if you're selling to small/medium business you're in more of a volume business and you can't justify the expense at that price-point.
Isn't that a different tangent? Does it matter, when the SaaS codebase is same for all tenants? If you are dealing with different codebase for different clients, then that is a flawed approach in that it is error prone and not scalable as tenants increase in number.
Even big software vendors whose products auto-update, e.g. Chrome, have to support differing versions to some degree - even if it's just to achieve A/B testing, phased upgrade rollouts, etc.
Being forced to flip the switch for all of your users at the same time can make that an awfully big and scary switch you're about to touch.
trying running a single SaaS codebase for enterprise clients. Maybe you can upgrade everyone with CI on a single commit, but no line of business solution wants that. We have to run a very strict and explicit upgrade cycle for our apps that allows them to test extensively before committing to newer versions.
Just because you're SaaS doesn't mean your clients are...
It's not a 100% fail proof solution, but enforcing API versions in the the request helps. /latest/ is available in our preprod environment, but in production you can only call the API with an explicit version.
Traditional enterprise software suffers from this problem. SaaS doesn't really. Having to deal with different versions of a codebase for different customers was hellish and I don't want to do it again.
Telling the customer that "we are SaaS and we will distribute all updates on our schedule" is really convenient for the provider. However, is that really the best service to your clients?
It's significantly better for us as a team to be able to focus on projects and improvements that benefit everyone. I've noticed the business starts to suffer when we focus on individual clients with narrow needs. Any one client only brings in 1/4000th of our revenue, so spending 1/4 of our development capacity on just something for them is a huge opportunity cost when we could be using it on something that benefits all our customers.
It was a different ballgame when I worked on on-prem software that cos millions and we only had 10 customers with a few customers sharing the same version.
What's best to your client is that you do not break the app or redesign the UI every other Wednesday. This has little relation with the upgrade schedule.
IMO this is a great problem to raise, but is best solved by feature toggles and maintaining a single application codeline. Separate codelines are absolutely brutal to maintain.
Earlier we used to have multiple web apps with separate databases, now we are using a single web app that can connect to different databases and configuration based on the subdomain, so far it's worked great, having a single web app really makes development and deployment a lot easier.
We often host multiple databases in a single rds instance to reduce costs.
We get data isolation and don't have to deal with sharding, of course this works well for enterprise applications with just 50 - 100 tenants.
It’s really best to design everything for a shared app shared db model, and to have a user/tenant heirarchy in every app. You can start with each user belonging to their own tenant. You can in the future shard the db and/or app by tenant for scalability, you can provide a dedicated instance as required for data residency or whatever.
Perhaps I missed the point, but it seems strange and artificial to me that all this article considers when discussing multi-tenant architectures are the database and the app. There's so much more that goes into actually delivering multi-tenancy in production.
Long ago I was the product manager of a complex enterprise platform which had been heavily customized for one of our large banking customers. They hosted the database on SQL Server shared clusters and much of the application backend in VMware instances running on "mainframe-grade" servers (dozens of cores, exotic high-speed storage). The hardware outlay alone was many hundred thousand dollars, and we interfaced with no less than 5 FTE's who comprised part of the teams maintaining it. Ours was one of a few applications hosted on their stack.
Despite repeated assurances of dedicated resource provisioning committed to us, our users often reported intermittent performance issues resulting in timeouts in our app. I was the first to admit our code had lots of runway remaining for performance optimization and more elegant handling of network blips. We embarked on a concerted effort to clean it up and saw huge improvements (which happily amortized to all of our other customers), but some of the performance issues still lingered. Over and over again in meetings IT pointed their fingers at us.
Eventually we replicated the issue in their DEV environment using lots of scrubbed and sanitized data, and small armies of volunteer users. I had a quite powerful laptop for the time (several CPU cores, 32GB RAM, high-end SSD's in RAID) and during our internal testing I actually hosted an entire scaled-down version of their DEV environment on it. During a site visit, we migrated their scrubbed data to my machine and connected all their clients to it. That's right, my little laptop replaced their whole back-end. It ran a bit slower but after several hours the users reported zero timeouts. This cheeky little demonstration finally caught the attention of some higher-up VP's who pushed hard on their IT department. A week later they traced the issue to a completely unrelated application that somehow managed to monopolize a good chunk of their storage bandwidth at certain points in the day. Our application was one of their more-utilized ones, but I bet correcting this issue must also have brought some relief to their other "tenants".
I know this isn't a perfect example, but it demonstrates how architecture encompasses a whole lot more than just the DB and apps. There's underlying hardware, resource provisioning, trust boundaries, isolation and security guarantees, risk containment, management, performance, monitoring and alerting, backups, availability and redundancy, upgrade and rollback capabilities, billing, etc. When you scale up toward Heroku/AWS/Azure/Google Cloud size I imagine such concerns must be quite prominent.
We often have customers complaining that our applications run significantly slower after they receive an upgrade. Our usual procedure is always telling them that we have not seen any performance regression internally or at other customers, so we kindly ask them to look over what other changes they have made to their environment, be it hardware or software.'
If they insist that the problem is our software, we tell them that we will begin troubleshooting, but if the error is outside of our responsibility, we will bill for all the hours used. 19 out of 20 times we restore the old version and benchmark them against each other and the performance is comparable. At that point they go back and recheck something, and it turns out they allocated resources differently or another application was upgraded too.
I quite like the Wordpress Multisite model which deploys a separate set of tables for each blog in a single MySQL database. Then you can add on the HyperDB plugin which lets you create rules to split the sets of tables into different databases. This gives a lot of flexibility.
I design my products as multi-tenant (both code and database). This does not mean however that the result can not be used as if it was a bunch of single tenant instances. It is up to the client how they decide to deploy.
No mention of sharding by usage pattern, which is the usual pattern at scale, eg partition potentially app and database differently for users with different fanout or scale or other properties that affect scaling.
Are there any strategies for migrating from a separate db per tenant to shared db with scoped tenant_id? In this case each tenant would have overlapping primary keys.
I guess it depends on your customer model, but surely the keyword there is "begin"?
It all depends on your customer/business model. If you're expecting to get to 1000s of clients quickly (i.e. within 12 months), then this will be an ops nightmare, unless you have really good automation.
AFAIK, historically that hasn't happened. Sales is shit hard. Getting every single client is an art in itself. You got to be a magician to get even 5-10 clients signed-up readily in one go, for a start-up. And by the time you are having "N" paying clients that matches your metric for product-market fitness, you start evolving your architecture to the next level. This is where Product and Engineering sit-together to take that strategic decision on the path forward.
I worked on a system with the shared app + shared database model. At its core, we received events (5-10KB) from customers and did something with those events. In total, we were receiving 8K-10K events per second.
In terms of security and privacy, isolating an individual tenant from others wasn't so much of a concern as each tenant was a customer within the organization with the same data classification level. So from a security perspective, we were "okay".
Where this gets interesting is that one tenant would suddenly decide to push a massive volume of data. Now processing events within a specific SLA was a critical non-functional requirement with this system. So then our on-call engineers would get alerts because shared messaging queues were getting backed up since Mr. Bob had decided to give us 3-5x his typical volume.
The traffic spike from one customer, which could last from minutes to hours, would negatively impact our SLAs with the other customers. Now all the customers would be upset. ^_0
Being internal customers, they were willing to pay for the excess traffic, but we didn't really have the tooling to auto scale. Our customers also didn't want us to rate limit them. Their expectation was that when they have traffic spikes, we need to be able to deal with it.
Now – we didn't want to run extra machines that sat idle like 90% of the time. And when we had these traffic spikes, we'd see our metrics and find maxed out CPU, memory, and even worse, we'd consume all disk space from the log volume on the machines filling everything up. The hosts would become zombies until someone logged in and manually freed up disk.
There were a few lessons learned:
1. Rate limit your customers (if your organization allows).
2. If your customers are adamant that in some instances each month, they need to be able to send you 5x the traffic without any notice, then you can't just rate limit them and be done with it. We adopted a solution where we would let our queues back up while some monitors would detect the excessive CPU or memory usage and would start scaling out the infrastructure. Once our monitors saw the message queues were looking normal again, they'd wait a little while and then scale back down.
3. When you're processing from a message queue, you need to capture metrics to track which customer is sending you what volume. Otherwise, you can have metrics on the message queues themselves and have one queue per customer.
4. If it's a matter of life and death (it wasn't, but that's how one customer described it), something you can do is stop logging when disk space usage exceeds a specific amount.
5. Also – when you have a high throughput system, think very carefully about every log statement you have. What is its purpose? Does it really add value?
I'd advise to set a classic rate limiting on the front load balancers, something like 6k requests per minute per IP.
This works really well to stop clients from doing this sort of things in the first place. Client devs get 429 errors for spamming the shit out of your server, they add some sleep to spread out the requests a bit. Everybody wins.
You will never hear any complaint about it, it's infinitely easier for the customer to add a sleep than to figure out how to contact your support.