Hacker Newsnew | past | comments | ask | show | jobs | submit | Zvez's commentslogin

That's basically scaled up story of 'I store my files on my computer and it is 10x cheaper than using dropbox'

While disks fail rate is already explored in another threads here, there is one related thing that catch my interest. Disk failure in such setup is not just cost of new disk + replacement cost (someone has to go there and change it!). It also inconvenience with dealing with failing requests. Ok, you are willing to lose 5% of your dataset. But are your '200-lines of code' robust enough to handle such cases. What if disk didn't fail, but start to be veeeeery slow. Does your training process can efficiently skip such bad objects. Do you have enough transparency to understand how much data you already lost? Is it still below 5%? And so on and so forth.

I feel like this article was written right after they built this construction and before let say 6 months of usage. Because I'm pretty sure their costs will go much higher than they calculated here. Especially if they start including hidden costs, like the work needed to be done on training side.

Yes, cost for self-hosting most probably still be less than aws (aws is not cheap). But it might start to be comparable with storage solutions of small ('neo') cloud providers if you buy gpu there.


calling everything 'for AI' is the new standard

>if you're reading from, like, big Parquet files, that probably means lots of random reads

and it also usually means that you shouldn't use s3 in the first place for workloads like this. Because they are usually very inefficient comparing to distributed fs. Unless you have some prefetch/cache layer, you will get both bad timings and higher costs


But a distributed FS is far more expensive than cloud blob storage would be, and I can't imagine most workloads would need the features of a POSIX filesystem.


>They shouldn't have sold the games on Steam in countries where PSN is not available

yep, as simple as that this would have made the situation much better. Game would have gotten lower scores and player base overall. Doing it now and like this leaves the feeling, that they got enough money from sales and now they want to bring some traffic to psn


If you give access to your DB directly, your API effectively becomes your API with all the contract obligations of the API. Suddenly you don't completely control your schema: you can't freely change it, you need to add things there for your clients only. I've seen it done multiple times and it always end up poorly. You save some time now by removing the need to build API, but later you end up spending much more time trying to decouple your internal representation from schema you made public.


Absolutely correct, listen to this article's ideas with great scepticism!

The system that I'm currently responsible for made this exact decision. The database is the API, and all the consuming services dip directly into each other's data. This is all within one system with one organisation in charge, and it's an unmanageable mess. The pattern suggested here is exactly the same, but with each of the consuming services owned by different organisations, so it will only be worse.

Change in a software system is inevitable, and in order to safety manage change you require a level of abstraction between inside a domain and outside and a strictly defined API contract with the outside that you can version control.

Could you create this with a layer of stored procedures on top of database replicas as described here? Theoretically yes, but in practice no. In exactly the same way that you can theoretically service any car with only a set of mole-grips.


This is just an interface, and you have the same problems with versioning and compatibility as you do with any interface. There's no difference here between the schema/semantics of a table and the types/semantics of an API.

IME what data pipelines do is they implement versioning with namespaces/schemas/versioned tables. Clients are then free to use whatever version they like. You then have the same policy of support/maintenance as you would for any software package or API.


> There's no difference here between the schema/semantics of a table and the types/semantics of an API.

There is a big difference. The types of an API can be changed independently of your schema.


You're looking at the wrong layer. If we were to go to the layer you're talking about, we'd have internal and external tables where we could change the structure of the internal tables, and the rebuild/rematerialize the external tables/views from the internal ones.


If the external tables are views that can combine select columns from multiple tables with computed fields - maybe. In theory it’s good, in practice I’ve never seen it done well.


I do think tools to manage this stuff... basically don't exist, so I'm sympathetic to the argument that while there's mostly equivalency between data and software stacks, software stacks are way more on the rails than data stacks are. Which is to say, I have seen this stuff work well with experienced data engineers, but I think you need more experience to get the same success on the data side than you do on the software side.


Yeah, I could see that. It’s not common and the tooling is primitive. Same thing I would say about event sourcing. Great in theory, but it’s more likely to get your average team into trouble.


That’s the critical point - in theory this idea is fine.

In reality other ways of solving the same problem have a decade of industry knowledge, frameworks and tooling behind them.

Is the marginal gain from this approach being a slightly better conceptual match for a given problem than the “normal way” worth throwing away all of that and starting again for?

Definitely not in my opinion. You’ll need to spend so much effort on the tooling and lessons before you’re at the point where you can see that marginal gain appear.


> That’s the critical point - in theory this idea is fine.

I've worked on production systems where this kind of stuff worked very well. I think there's weirdly a big wall between software and data, which is a shame, because the data world has a lot to offer SWEs (I've certainly learned tons, anyway).

> In reality other ways of solving the same problem have a decade of industry knowledge, frameworks and tooling behind them.

It's pretty likely that any database you're working with is as old or older than any software stack. Java, PHP, and MySQL were all released in '95 (Java and MySQL on the very same day, which is wild), PostgreSQL was '96. Commercial DBs are even older, SQL Server is '89, Oracle is '79, DB2 and SQL itself is 70s. There's a rich history on the data side too.

> Is the marginal gain from this approach being a slightly better conceptual match for a given problem than the “normal way” worth throwing away all of that and starting again for?

The gain is pretty tremendous: you don't need an app server, or at least you only need a very thin one. Tech has probably spent billions of dollars building app servers over the last 30 years. They're hard to build and even harder to maintain. Frankly, I'm tired of stacking up huge piles of code just to transpile JSON/gRPC to SQL and back again.

> Definitely not in my opinion. You’ll need to spend so much effort on the tooling and lessons before you’re at the point where you can see that marginal gain appear.

There's a lot of tooling, it's generally just built into the DB itself. And a lot of software tools work great with DBs. You can store your schemas and query libraries in git. You can hook up your CI/CD pipeline right into your database.

I also can't recommend dbt enough [0]; it's basically the best on-ramp for SWEs into data engineering out there.

[0]: https://www.getdbt.com/


Versioned views, materialized views or procedures are the solution to this. It is frequent that even internally, companies don't give access to their raw data but rather to a restricted schema containing a formated subset of it.


Views will severely restrict the kinds of changes you might want to do in the future. For example now you can't just move some data from your database into S3 or REST service.

Stored procedures technically can do anything, I guess, but at that point you would be better with traditional services which will give you more flexibility.


A view can also do anything - it could query a REST service, for example. (Not saying that this is necessarily a good idea, though...)


Is that a real thing? What DBMSs support such views?


Most of the heavy artillery RDMSes at least, eg Postgres let’s you mount arbitrary HTTP resources as tables, which you then can put views over: https://wiki.postgresql.org/wiki/Foreign_data_wrappers


This sounds like a total minefield. You might get a response in the same format, but I imagine it'd be very easy to break an API user who accidentally depends on performance characteristics (for example).


AWS RDS and Aurora both support synchronous and asynchronous lambda invocations from the database. Should be used very carefully, but when you want/need a fully event-driven architecture, it's wonderful.

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Postg...


It is a _phenomonal_ waste of money in licensing fees but MSSQL server can embed C# dlls and as a result run arbitrary code via its CLR integration


Postgres can run arbitrary code too, but at this point it just makes more sense to create a service that acts as a database and translates sql to whatever (people already do that). This however makes whole game pointless, as we are back where we were.


Of course it’s possible, but now you need more people with DB and SQL knowledge.

Also, using views and stored procedures with source control is a pain.

Deploying these into prod is also much more cumbersome than just normal backend code.

Accessing a view will also be slower than accessing an “original” table since the view needs to be aggregated.


> Accessing a view will also be slower than accessing an “original” table since the view needs to be aggregated.

Where does it say anything needs aggregating. You can have a view that exists just for security.

> Also, using views and stored procedures with source control is a pain. Deploying these into prod is also much more cumbersome than just normal backend code.

Uh? This is normal backend code.


I don't see the problem here.

Are modern developers allergic to SQL or what is the issue?


Not all devs are proficient in SQL. Its another skill that is required.


In addition if you are using postgres, then there is postgresRest to make api really quick and nice.


why would you want to develop your api in sql over a traditional language?

versioned views and materialized views are essentially api endpoints in this context. just developed in sql instead of some sane language.


A lot of my backend career has been essentially Greenspun's 10th Law: Any sufficiently complicated REST or GraphQL API contains an ad-hoc, informally-specified bug-ridden slow implementation of half of SQL.

SQL is a legendary language, it's powerful enough to build entire APIs out of (check out PostGraphile, PostgREST, and Hasura) but also somehow simple enough that non-technical business analysts can use it. It's definitely worth spending time on.


You can let your API's users do arbitrary queries on any sets of data you expose if your API exposes SQL views directly. Plenty of apps don't need anything beyond basic "get by id" and "search by name" endpoints, but plenty others do. At that point, with traditional backends, you're either reimplementing SQL in your API layer, or creating a new endpoint for each specific usecase.


> Versioned views, materialized views or procedures are the solution to this.

Wouldn't it be far simpler to just create a service providing access to those views with something like OData?


Whether it's method calls or database schema - isn't what really matters is control of what's accessible and the tools you have to support evolution?

So when you provide an API - you don't make all functions in your code available - just carefully selected ones.

If you use the DB schema as a contract you simply do the same - you don't let people access all functions - just the views/tables they need/you can support.

Just like API's, databases have tools to allow you to evolve - for example, maintaining views that keep a contract while changing the underlying schema.

In the end - if your schema dramatically changes - in particular changes like 1:1 relation moving to a 1:many - it's pretty hard to stop that rippling throughout your entire stack - however many layers you have.


> Just like API's, databases have tools to allow you to evolve - for example, maintaining views that keep a contract while changing the underlying schema.

What are the database tools for access logs, metrics on throughput, latency, tracing etc.? Not to mention other topics like A/B tests, shadow traffic, authorization, input validation, maintaining invariants across multiple rows or even tables...

Databases often either have no tools for this or they are not quite as good.


- Access logs: audit logging [0]

- Throughput/latency: pg_stat_statements [1] or Prometheus' exporter [2]

- A/B tests: aren't these frontend things? recording which version a user got is an INSERT

- Auth: row-level security [3] and session variables

- Tracing, shadow traffic: I don't think these are relevant in a "ship your database" setup.

- Valdation: check constraints [4] and triggers [5]

Maybe by some measures they're "not quite as good", but on the other hand you get them for free with PostgreSQL. I've lost count of how many bad internal versions of this stuff I've built.

[0]: https://severalnines.com/blog/postgresql-audit-logging-best-...

[1]: https://www.postgresql.org/docs/current/pgstatstatements.htm...

[2]: https://grafana.com/oss/prometheus/exporters/postgres-export...

[3]: https://www.postgresql.org/docs/15/ddl-rowsecurity.html

[4]: https://www.postgresql.org/docs/15/ddl-constraints.html

[5]: https://www.postgresql.org/docs/15/plpgsql-trigger.html


Honestly, there's plenty of tools out there that can do the same thing.

The important crux of the counterpoint to this article is "if you ship your database, it's now the API" and everything that comes along with that.

All the problems you _think_ you're sidestepping by not building an API, you're actually just compounding further down the line when you need to do things to do your database other than simply "adding columns to a table". :\

Edit: re-reading, the point I didn't make is that having your database be your API _is_ viable, so long as you actually treat it as an API instead of an internal data structure.


You can do impedance-matching code in a database, e.g. in stored procedures, but I think the experience is strictly worse than all the application-level tooling that's available.


Not sure what you mean.

Are you just taking about the expected shape of the data - the consumer of the database can do that either in SQL or at some later layer they control.

If you are talking about my 1:1 -> 1:N problem. I'd argue that can ripple all the way though to your UI ( you now need to show a list, where once it was a single value etc ) - not something you can actually fix at the API level per se.

Bottom line, the more layers of indirection, the more opportunities you have to transform - but potentially also the more layers you do have to transform if the change is so big that you can't contain it.

Let's be clear - I'd typically favour APIs -especially if I don't control the other end. But I'm saying it's about the principals of surface area and evolvability, not really whether it's an API or SQL access.


I have spent my entire, long, career, fighting against someone who thought this was a good idea, unpicking systems where they implemented it or bypassing systems where this was implemented. It's a many-headed hydra that keeps recurring but rarely have I seen it laid out as explicitly as this headline.


I guess that's what one gets for reading just the headline? TFA talks about the downsides called out in this thread explicitly.

tbf, the idea isn't as novel. Data warehouses, for instance, provide SQL as a direct API atop it.


I have, in fact, read the article, and they are _vastly underestimating_ the importance of those downsides. For instance, I once dealt with an issue that involved adding a column to a table, which they think shouldn't be too bad, that took two actual years to resolve because of all of the infrastructure built on top of it that bound directly to the table structure.


But surely the problem is with the infrastructure that can't deal with an extra column - not the db/table itself?

If all the users of an API were bound to the shape of the data returned, but you wanted to add an extra field, you'd have exactly the same problem surely?

Sounds like the problem was with too much magic in the layers above - as in the end the shape of the data returned from a query on a table is up to the client - you can control it directly with SQL - in fact dealing with an extra column or not is completely trivial in SQL.


I mean, exactly, that’s why this is a bad idea. Adding a column is simple, having your DB be your API is madness. The more magic you add, the worse it gets.


I think you are mixing two problems - having the part of the DB exposed, and whether people then build brittle stuff on top.

My point is that if people build brittle stuff on top that's not a problem of the DB being accessible per se.

That could just as easily happen against an API.

I assume you had was some problem with ORM's and automatically built data structures etc - I would argue that's a problem with those, not with the DB.


> If you give access to your DB directly, your API effectively becomes your API with all the contract obligations of the API. Suddenly you don't completely control your schema: you can't freely change it, you need to add things there for your clients only. I've seen it done multiple times and it always end up poorly.

In a past life, I worked for a large (non-Amazon) online retailer, and "shipping the DB" was a massive boat anchor the company had to drag around for a long time. They still might be, for all I know. So much tech and infra sprung up to work around this, but at some point everything came back to the some database with countless tables and columns where no one knew the purpose, but couldn't change because it might break some random team's work.


That's [another reason] why you use stored procedures and only call them (rather than hardcoded or ORM-generated SQL queries) in your client app code.


I think this point is addressed in the article.


Came here to say this too.

From the article:

> A less obvious downside is that the contract for a database can be less strict than an API. One benefit to an API layer is that you can change the underlying database structure but still massage data to look the same to clients. When you’re shipping the raw database, that becomes more difficult. Fortunately, many database changes, such as adding columns to a table, are backwards compatible so clients don’t need to change their code. Database views are also a great way to reshape data so it stays consistent—even when the underlying tables change.

Neither solution is perfect (raw read replica vs API). Pros and Cons to both. Knowing when to use which comes down to one's needs.


This 100%.

My last customer used an ETL tool to orchestrate their data loads between applications, but the only out of the box solution was a DB-Reader.

Eventually, no system could be changed without breaking another system and the central GIS system had to be gradually phased out. This also meant that everybody must had to use Oracle databases, since this was the "best supported platform".


On the next iteration some consultancy will replace that with a bunch of microservices using a dynamic language.

When that thing fails again they will hopefully settle on a sane monolithic API.


Yeah this is my gripe with things like Firebase Realtime Database.

Don't get me wrong, the amount of time it saves is massive compared to rolling your own equivalent, but it doesn't take long before you've dug yourself a big hole that would conventionally be solved with a thin API layer.


Also you shouldn't give up access to your DB for security reasons.

That's why API exists at first place.


PostgreSQL 9.5 (7.5 years old) shipped row-level security [0] which solves this.

[0]: https://www.postgresql.org/docs/15/ddl-rowsecurity.html


The architecture described in the article replicates the SQLite database on the page level.


Yeah but this thread became about "you need an API".


Technically you can create different users with very precise access permissions. Might not be the good idea to provide that kind of API to the general public, but if your clients are trustworthy, it might work.


No clients are trustworthy.


You could ship the database together with python/JS/whatever 'client library' - and you tell your clients that they need to use your code if they want to be supported.


You just know they're going to run custom code, fck up their database and then still complain.

I'm not tooo familiar with DBs, but I know customers. They're going to present custom views to your client SDK. They're going to mirror your read-only DB into their own and implement stuff there. They're going to depend on every kind of implementation detail of your DB's specific version ("It worked with last version and YOU broke it!"). They're going to run the slowest Joins you've ever seen just to get data that belongs together anyway and that you would have written a performant resolver for.

Oh, and of course, you will need 30 client libraries. Python, Java, Swift, C++, JavaScript and 6+ versions each. Compare that to "hit our CRUD REST API with a JSON object, simply send the Authorization Bearer ey token and you're fine."


This is the worst of both worlds. Not only are you back to square one, as you spent the time to build an API (client libraries), but now, if the API is limiting, the users will find ways of accessing the SQLite db directly.


Are you assuming clients will actually upgrade the library on a regular basis?


You can use stored procedures if you want to add another abstraction layer.


They had stored procedures in the "old days" when they figured out that direct access to the database was a bad idea, so what has changed? (I agree that a DB view often is good enough thoug, but they ALSO had that in the "old days", IDK what has changed about that:-p )


yeah reminds me of meteor JS


The title reads like it came from an MBA biz-bro that doesn't want to do anything properly because it wastes time and costs money. FWIW, I skimmed the article.

Building an API for a new application is a pretty simple undertaking and gives you an abstraction layer between your data model and API consumers. Building a suite of tests against that API that run continuously with merges to a develop/test environment will help ensure quality. Why would anyone advise to just blatantly skip out on solid application design principles? (clicks probably)


The guy knows what he's talking about [0].

> Building an API for a new application is a pretty simple undertaking

This is super untrue, backend engineering is pretty hard and complicated, and there aren't enough people to do it. And this is coming from someone who thinks it should be replaced with SaaS stuff like Hasura and not a manual process anymore.

> Building a suite of tests against that API that run continuously with merges to a develop/test environment will help ensure quality.

You can test your data pipelines too; we do at my job and it's a lot easier than managing thousands of lines of PyTest (or whatever) tests.

> Why would anyone advise to just blatantly skip out on solid application design principles?

Because building an API takes a lot of time and money, and maintaining it takes even more. It would be cool if we didn't have to do it.

[0]: https://github.com/benbjohnson


while this is technically correct, difference is that your product won't survive without openai at this point. If you need the model quality openai provides you are stuck and your product can just disappear. Because llm is core building block, irreplaceable one.


Building a product that relies 100% on a single external vendor is taking a huge risk. So many companies have been burned by this in the past that it's amazing anyone doesn't see it as a risky thing.

So I have to believe that the people making these products are intending to make as much cash as possible up front and aren't aiming for a long-term thing.


Use the facade pattern so it’s easier to swap out if needed.


>Nintendo didn't really need to do any marketing

still they did.

I'm in Warsaw and I see countless of zelda ads: whole walls of huge buildings are covered with them. They probably spent insane amount of money if they advertise it like this everywhere


I guess it depends on what you do and your goals. It might be not necessary to do average developers job (and full disclosure, it wasn't necessary 10 years ago as well). But understanding fundamentals gives you insights to be better prepared to choose right 'interface' when you need to. Also you can see it as a way to stretch your 'programmer muscles'.

After going through lecture topics list, I think most of those you actually need to know as a working programmer. Not because they are prerequisites, but because after couple of years in the field, you will have to touch most of those topics anyway.


>Let's keep in mind that modern hardware and software is very stable, generally

Not at any significant scale. dimm will fail, power will be down, disks will need replacement.

It is all about risks after all. If you are ok to have couple of hours of downtime if one of the memory sticks stops working - good for you. But generally any large enough business won't tolerate such risks.

I'm not saying that the cloud is the answer, but I don't see any future for single instance solution. And if you design your system like this, you are taking much more risks than necessary.


their point is simple: during the last 20-30 years we had efficiency improvements by orders of magnitude. Still we didn't have developers jobs reduction in any way. We evolved into doing more and more high-level and high-scale jobs. So no, -10% of time needed to build the same unit of functionality won't lead to -10% reduction of workforce. It will change the scope of said workforce.


using 'unclean' code practice will increase development costs. And more importantly - maintainability of such code.

>Virtually every distributed service built today is able to take advantage of this

most of built today services can take much more advantage in using better system design practices.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: