That's basically scaled up story of 'I store my files on my computer and it is 10x cheaper than using dropbox'
While disks fail rate is already explored in another threads here, there is one related thing that catch my interest. Disk failure in such setup is not just cost of new disk + replacement cost (someone has to go there and change it!). It also inconvenience with dealing with failing requests. Ok, you are willing to lose 5% of your dataset. But are your '200-lines of code' robust enough to handle such cases. What if disk didn't fail, but start to be veeeeery slow. Does your training process can efficiently skip such bad objects. Do you have enough transparency to understand how much data you already lost? Is it still below 5%? And so on and so forth.
I feel like this article was written right after they built this construction and before let say 6 months of usage. Because I'm pretty sure their costs will go much higher than they calculated here. Especially if they start including hidden costs, like the work needed to be done on training side.
Yes, cost for self-hosting most probably still be less than aws (aws is not cheap). But it might start to be comparable with storage solutions of small ('neo') cloud providers if you buy gpu there.
>if you're reading from, like, big Parquet files, that probably means lots of random reads
and it also usually means that you shouldn't use s3 in the first place for workloads like this. Because they are usually very inefficient comparing to distributed fs. Unless you have some prefetch/cache layer, you will get both bad timings and higher costs
But a distributed FS is far more expensive than cloud blob storage would be, and I can't imagine most workloads would need the features of a POSIX filesystem.
>They shouldn't have sold the games on Steam in countries where PSN is not available
yep, as simple as that
this would have made the situation much better. Game would have gotten lower scores and player base overall.
Doing it now and like this leaves the feeling, that they got enough money from sales and now they want to bring some traffic to psn
If you give access to your DB directly, your API effectively becomes your API with all the contract obligations of the API. Suddenly you don't completely control your schema: you can't freely change it, you need to add things there for your clients only. I've seen it done multiple times and it always end up poorly. You save some time now by removing the need to build API, but later you end up spending much more time trying to decouple your internal representation from schema you made public.
Absolutely correct, listen to this article's ideas with great scepticism!
The system that I'm currently responsible for made this exact decision. The database is the API, and all the consuming services dip directly into each other's data. This is all within one system with one organisation in charge, and it's an unmanageable mess. The pattern suggested here is exactly the same, but with each of the consuming services owned by different organisations, so it will only be worse.
Change in a software system is inevitable, and in order to safety manage change you require a level of abstraction between inside a domain and outside and a strictly defined API contract with the outside that you can version control.
Could you create this with a layer of stored procedures on top of database replicas as described here? Theoretically yes, but in practice no. In exactly the same way that you can theoretically service any car with only a set of mole-grips.
This is just an interface, and you have the same problems with versioning and compatibility as you do with any interface. There's no difference here between the schema/semantics of a table and the types/semantics of an API.
IME what data pipelines do is they implement versioning with namespaces/schemas/versioned tables. Clients are then free to use whatever version they like. You then have the same policy of support/maintenance as you would for any software package or API.
You're looking at the wrong layer. If we were to go to the layer you're talking about, we'd have internal and external tables where we could change the structure of the internal tables, and the rebuild/rematerialize the external tables/views from the internal ones.
If the external tables are views that can combine select columns from multiple tables with computed fields - maybe. In theory it’s good, in practice I’ve never seen it done well.
I do think tools to manage this stuff... basically don't exist, so I'm sympathetic to the argument that while there's mostly equivalency between data and software stacks, software stacks are way more on the rails than data stacks are. Which is to say, I have seen this stuff work well with experienced data engineers, but I think you need more experience to get the same success on the data side than you do on the software side.
Yeah, I could see that. It’s not common and the tooling is primitive. Same thing I would say about event sourcing. Great in theory, but it’s more likely to get your average team into trouble.
That’s the critical point - in theory this idea is fine.
In reality other ways of solving the same problem have a decade of industry knowledge, frameworks and tooling behind them.
Is the marginal gain from this approach being a slightly better conceptual match for a given problem than the “normal way” worth throwing away all of that and starting again for?
Definitely not in my opinion. You’ll need to spend so much effort on the tooling and lessons before you’re at the point where you can see that marginal gain appear.
> That’s the critical point - in theory this idea is fine.
I've worked on production systems where this kind of stuff worked very well. I think there's weirdly a big wall between software and data, which is a shame, because the data world has a lot to offer SWEs (I've certainly learned tons, anyway).
> In reality other ways of solving the same problem have a decade of industry knowledge, frameworks and tooling behind them.
It's pretty likely that any database you're working with is as old or older than any software stack. Java, PHP, and MySQL were all released in '95 (Java and MySQL on the very same day, which is wild), PostgreSQL was '96. Commercial DBs are even older, SQL Server is '89, Oracle is '79, DB2 and SQL itself is 70s. There's a rich history on the data side too.
> Is the marginal gain from this approach being a slightly better conceptual match for a given problem than the “normal way” worth throwing away all of that and starting again for?
The gain is pretty tremendous: you don't need an app server, or at least you only need a very thin one. Tech has probably spent billions of dollars building app servers over the last 30 years. They're hard to build and even harder to maintain. Frankly, I'm tired of stacking up huge piles of code just to transpile JSON/gRPC to SQL and back again.
> Definitely not in my opinion. You’ll need to spend so much effort on the tooling and lessons before you’re at the point where you can see that marginal gain appear.
There's a lot of tooling, it's generally just built into the DB itself. And a lot of software tools work great with DBs. You can store your schemas and query libraries in git. You can hook up your CI/CD pipeline right into your database.
I also can't recommend dbt enough [0]; it's basically the best on-ramp for SWEs into data engineering out there.
Versioned views, materialized views or procedures are the solution to this. It is frequent that even internally, companies don't give access to their raw data but rather to a restricted schema containing a formated subset of it.
Views will severely restrict the kinds of changes you might want to do in the future. For example now you can't just move some data from your database into S3 or REST service.
Stored procedures technically can do anything, I guess, but at that point you would be better with traditional services which will give you more flexibility.
This sounds like a total minefield. You might get a response in the same format, but I imagine it'd be very easy to break an API user who accidentally depends on performance characteristics (for example).
AWS RDS and Aurora both support synchronous and asynchronous lambda invocations from the database. Should be used very carefully, but when you want/need a fully event-driven architecture, it's wonderful.
Postgres can run arbitrary code too, but at this point it just makes more sense to create a service that acts as a database and translates sql to whatever (people already do that).
This however makes whole game pointless, as we are back where we were.
> Accessing a view will also be slower than accessing an “original” table since the view needs to be aggregated.
Where does it say anything needs aggregating. You can have a view that exists just for security.
> Also, using views and stored procedures with source control is a pain. Deploying these into prod is also much more cumbersome than just normal backend code.
A lot of my backend career has been essentially Greenspun's 10th Law: Any sufficiently complicated REST or GraphQL API contains an ad-hoc, informally-specified bug-ridden slow implementation of half of SQL.
SQL is a legendary language, it's powerful enough to build entire APIs out of (check out PostGraphile, PostgREST, and Hasura) but also somehow simple enough that non-technical business analysts can use it. It's definitely worth spending time on.
You can let your API's users do arbitrary queries on any sets of data you expose if your API exposes SQL views directly. Plenty of apps don't need anything beyond basic "get by id" and "search by name" endpoints, but plenty others do. At that point, with traditional backends, you're either reimplementing SQL in your API layer, or creating a new endpoint for each specific usecase.
Whether it's method calls or database schema - isn't what really matters is control of what's accessible and the tools you have to support evolution?
So when you provide an API - you don't make all functions in your code available - just carefully selected ones.
If you use the DB schema as a contract you simply do the same - you don't let people access all functions - just the views/tables they need/you can support.
Just like API's, databases have tools to allow you to evolve - for example, maintaining views that keep a contract while changing the underlying schema.
In the end - if your schema dramatically changes - in particular changes like 1:1 relation moving to a 1:many - it's pretty hard to stop that rippling throughout your entire stack - however many layers you have.
> Just like API's, databases have tools to allow you to evolve - for example, maintaining views that keep a contract while changing the underlying schema.
What are the database tools for access logs, metrics on throughput, latency, tracing etc.? Not to mention other topics like A/B tests, shadow traffic, authorization, input validation, maintaining invariants across multiple rows or even tables...
Databases often either have no tools for this or they are not quite as good.
- Throughput/latency: pg_stat_statements [1] or Prometheus' exporter [2]
- A/B tests: aren't these frontend things? recording which version a user got is an INSERT
- Auth: row-level security [3] and session variables
- Tracing, shadow traffic: I don't think these are relevant in a "ship your database" setup.
- Valdation: check constraints [4] and triggers [5]
Maybe by some measures they're "not quite as good", but on the other hand you get them for free with PostgreSQL. I've lost count of how many bad internal versions of this stuff I've built.
Honestly, there's plenty of tools out there that can do the same thing.
The important crux of the counterpoint to this article is "if you ship your database, it's now the API" and everything that comes along with that.
All the problems you _think_ you're sidestepping by not building an API, you're actually just compounding further down the line when you need to do things to do your database other than simply "adding columns to a table". :\
Edit: re-reading, the point I didn't make is that having your database be your API _is_ viable, so long as you actually treat it as an API instead of an internal data structure.
You can do impedance-matching code in a database, e.g. in stored procedures, but I think the experience is strictly worse than all the application-level tooling that's available.
Are you just taking about the expected shape of the data - the consumer of the database can do that either in SQL or at some later layer they control.
If you are talking about my 1:1 -> 1:N problem. I'd argue that can ripple all the way though to your UI ( you now need to show a list, where once it was a single value etc ) - not something you can actually fix at the API level per se.
Bottom line, the more layers of indirection, the more opportunities you have to transform - but potentially also the more layers you do have to transform if the change is so big that you can't contain it.
Let's be clear - I'd typically favour APIs -especially if I don't control the other end. But I'm saying it's about the principals of surface area and evolvability, not really whether it's an API or SQL access.
I have spent my entire, long, career, fighting against someone who thought this was a good idea, unpicking systems where they implemented it or bypassing systems where this was implemented. It's a many-headed hydra that keeps recurring but rarely have I seen it laid out as explicitly as this headline.
I have, in fact, read the article, and they are _vastly underestimating_ the importance of those downsides. For instance, I once dealt with an issue that involved adding a column to a table, which they think shouldn't be too bad, that took two actual years to resolve because of all of the infrastructure built on top of it that bound directly to the table structure.
But surely the problem is with the infrastructure that can't deal with an extra column - not the db/table itself?
If all the users of an API were bound to the shape of the data returned, but you wanted to add an extra field, you'd have exactly the same problem surely?
Sounds like the problem was with too much magic in the layers above - as in the end the shape of the data returned from a query on a table is up to the client - you can control it directly with SQL - in fact dealing with an extra column or not is completely trivial in SQL.
I mean, exactly, that’s why this is a bad idea. Adding a column is simple, having your DB be your API is madness. The more magic you add, the worse it gets.
> If you give access to your DB directly, your API effectively becomes your API with all the contract obligations of the API. Suddenly you don't completely control your schema: you can't freely change it, you need to add things there for your clients only. I've seen it done multiple times and it always end up poorly.
In a past life, I worked for a large (non-Amazon) online retailer, and "shipping the DB" was a massive boat anchor the company had to drag around for a long time. They still might be, for all I know. So much tech and infra sprung up to work around this, but at some point everything came back to the some database with countless tables and columns where no one knew the purpose, but couldn't change because it might break some random team's work.
> A less obvious downside is that the contract for a database can be less strict than an API. One benefit to an API layer is that you can change the underlying database structure but still massage data to look the same to clients. When you’re shipping the raw database, that becomes more difficult. Fortunately, many database changes, such as adding columns to a table, are backwards compatible so clients don’t need to change their code. Database views are also a great way to reshape data so it stays consistent—even when the underlying tables change.
Neither solution is perfect (raw read replica vs API). Pros and Cons to both. Knowing when to use which comes down to one's needs.
My last customer used an ETL tool to orchestrate their data loads between applications, but the only out of the box solution was a DB-Reader.
Eventually, no system could be changed without breaking another system and the central GIS system had to be gradually phased out. This also meant that everybody must had to use Oracle databases, since this was the "best supported platform".
Yeah this is my gripe with things like Firebase Realtime Database.
Don't get me wrong, the amount of time it saves is massive compared to rolling your own equivalent, but it doesn't take long before you've dug yourself a big hole that would conventionally be solved with a thin API layer.
Technically you can create different users with very precise access permissions. Might not be the good idea to provide that kind of API to the general public, but if your clients are trustworthy, it might work.
You could ship the database together with python/JS/whatever 'client library' - and you tell your clients that they need to use your code if they want to be supported.
You just know they're going to run custom code, fck up their database and then still complain.
I'm not tooo familiar with DBs, but I know customers. They're going to present custom views to your client SDK. They're going to mirror your read-only DB into their own and implement stuff there. They're going to depend on every kind of implementation detail of your DB's specific version ("It worked with last version and YOU broke it!"). They're going to run the slowest Joins you've ever seen just to get data that belongs together anyway and that you would have written a performant resolver for.
Oh, and of course, you will need 30 client libraries. Python, Java, Swift, C++, JavaScript and 6+ versions each. Compare that to "hit our CRUD REST API with a JSON object, simply send the Authorization Bearer ey token and you're fine."
This is the worst of both worlds. Not only are you back to square one, as you spent the time to build an API (client libraries), but now, if the API is limiting, the users will find ways of accessing the SQLite db directly.
They had stored procedures in the "old days" when they figured out that direct access to the database was a bad idea, so what has changed? (I agree that a DB view often is good enough thoug, but they ALSO had that in the "old days", IDK what has changed about that:-p )
The title reads like it came from an MBA biz-bro that doesn't want to do anything properly because it wastes time and costs money. FWIW, I skimmed the article.
Building an API for a new application is a pretty simple undertaking and gives you an abstraction layer between your data model and API consumers. Building a suite of tests against that API that run continuously with merges to a develop/test environment will help ensure quality. Why would anyone advise to just blatantly skip out on solid application design principles? (clicks probably)
> Building an API for a new application is a pretty simple undertaking
This is super untrue, backend engineering is pretty hard and complicated, and there aren't enough people to do it. And this is coming from someone who thinks it should be replaced with SaaS stuff like Hasura and not a manual process anymore.
> Building a suite of tests against that API that run continuously with merges to a develop/test environment will help ensure quality.
You can test your data pipelines too; we do at my job and it's a lot easier than managing thousands of lines of PyTest (or whatever) tests.
> Why would anyone advise to just blatantly skip out on solid application design principles?
Because building an API takes a lot of time and money, and maintaining it takes even more. It would be cool if we didn't have to do it.
while this is technically correct, difference is that your product won't survive without openai at this point. If you need the model quality openai provides you are stuck and your product can just disappear. Because llm is core building block, irreplaceable one.
Building a product that relies 100% on a single external vendor is taking a huge risk. So many companies have been burned by this in the past that it's amazing anyone doesn't see it as a risky thing.
So I have to believe that the people making these products are intending to make as much cash as possible up front and aren't aiming for a long-term thing.
I'm in Warsaw and I see countless of zelda ads: whole walls of huge buildings are covered with them. They probably spent insane amount of money if they advertise it like this everywhere
I guess it depends on what you do and your goals. It might be not necessary to do average developers job (and full disclosure, it wasn't necessary 10 years ago as well). But understanding fundamentals gives you insights to be better prepared to choose right 'interface' when you need to. Also you can see it as a way to stretch your 'programmer muscles'.
After going through lecture topics list, I think most of those you actually need to know as a working programmer. Not because they are prerequisites, but because after couple of years in the field, you will have to touch most of those topics anyway.
>Let's keep in mind that modern hardware and software is very stable, generally
Not at any significant scale. dimm will fail, power will be down, disks will need replacement.
It is all about risks after all. If you are ok to have couple of hours of downtime if one of the memory sticks stops working - good for you. But generally any large enough business won't tolerate such risks.
I'm not saying that the cloud is the answer, but I don't see any future for single instance solution. And if you design your system like this, you are taking much more risks than necessary.
their point is simple: during the last 20-30 years we had efficiency improvements by orders of magnitude. Still we didn't have developers jobs reduction in any way. We evolved into doing more and more high-level and high-scale jobs. So no, -10% of time needed to build the same unit of functionality won't lead to -10% reduction of workforce. It will change the scope of said workforce.
While disks fail rate is already explored in another threads here, there is one related thing that catch my interest. Disk failure in such setup is not just cost of new disk + replacement cost (someone has to go there and change it!). It also inconvenience with dealing with failing requests. Ok, you are willing to lose 5% of your dataset. But are your '200-lines of code' robust enough to handle such cases. What if disk didn't fail, but start to be veeeeery slow. Does your training process can efficiently skip such bad objects. Do you have enough transparency to understand how much data you already lost? Is it still below 5%? And so on and so forth.
I feel like this article was written right after they built this construction and before let say 6 months of usage. Because I'm pretty sure their costs will go much higher than they calculated here. Especially if they start including hidden costs, like the work needed to be done on training side.
Yes, cost for self-hosting most probably still be less than aws (aws is not cheap). But it might start to be comparable with storage solutions of small ('neo') cloud providers if you buy gpu there.