Hey HN! Post co-author here, excited to share our new open-source PostgreSQL tool that re-imagines vector embeddings as database indexes. It's not literally an index but it functions like one to update embeddings as source data gets added, deleted or changed.
Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon.
Eager to hear your feedback and reactions. If you'd like to leave an issue or better yet a PR, you can do so here [1]
Hey, this is really cool! Thanks for the article and the tool itself.
One question - in the RAG projects we've done, most of the source data was scattered in various source systems, but wasn't necessarily imported into a single DB or Data Lake. For example, building an internal Q&A tool for a company that has knowledge stored in services like Zendesk, Google Drive, an internal company Wiki, etc.
In those cases, it made sense to not import the source documents, or only import metadata about them, and keep the embeddings in a dedicated Vector DB. This seems to me to be a fairly common use case - most enterprises have this kind of data scattered across various systems.
How do you envision this kind of use case working with this tool? I may have missed it, but you mention things like working with images, etc, is your assumption that everyone is storing all of that data in Postgres?
Pretty smart. Why is the DB api the abstraction layer though? Why not two columns and a microservice. I assume you are making async calls to get the embeddings?
I say that because it seems n
unsual. Index would suit sync better. But async things like embeddings, geo for an address, is this email considered a spammer etc. feel like app level stuff.
The DB is the right layer from a interface point of view -- because that's where the data properties should be defined. We also use the DB for bookkeeping what needs to be done because we can leverage transactions and triggers to make sure we never miss any data. From an implementation point of view, the actual embedding does happen outside the database in a python worker or cloud functions.
Merging the embeddings and the original data into a single view allows the full feature set of SQL rather than being constrained by a REST API.
That is arguable because while it is a calculated field, it is not a pure one (IO is required), and not necessarily idempotent, not atomic and not guaranteed to succeed.
It is certainly convenient for the end user, but it hides things. What if the API calls to open AI fail or get rate limited. How is that surfaced. Will I see that in my observability. Will queries just silently miss results.
If the DB does the embedding itself synchronously within the write it would make sense. That would be more like elastic search or a typical full text index.
(co-author here) We automatically retry on failures in a while. We also log error messages in the worker (self-hosted) and have clear indicators in the cloud UI that something went wrong (with plans to add email alerts later).
The error handling is actually the hard part here. We don't believe that failing on inserts due to the endpoint being down is the right thing because that just moves the retry/error-handling logic upstream -- now you need to roll your own queuing system, backoffs etc.
Thanks for the reply. These are compelling points.
I agree not to fail on insert too by the way. The insert is sort of an enqueuing action.
I was debating if a microservice should process that queue.
Since you are a PaaS the distinction might be almost moot. An implementation detail. (It would affect the api though).
However if Postgres added this feature generally it would seem odd to me because it feels like the DB doing app stuff. The DB is fetching data for itself from an external source.
The advantage is it is one less thing for the app to do and maybe deals with errands many teams have to roll their own code for.
A downside is if I want to change how this is done I probably can't. Say I have data residency or securiry requirements that affect the data I want to encode.
I think there is much to consider. Probably the why not both meme applies though. Use the built in feature if you can, and roll your own where you can't.
This question keeps popping up but I don't get it. Everyone and their dog has an OpenAI-compatible API. Why not just serve a local LLM and put api.openai.com 127.0.0.1 in your hosts file?
I mean why is that even a question? Is there some fundamental difference between the black box that is GPT-* and say, LLaMA, that I don't grok?
This is super cool! One suggestion for the blog: I would put "re-imagines vector embeddings as database indexes. It's not literally an index but it functions like one to update embeddings as source data gets added, deleted or changed." as a tl/dr at the top.
It wasn't clear to me why this was significantly different than using pg_vector until I read that. That makes the rest of the post (e.g. why this you need the custom methods in a `SELECT`) make a lot more sense in context
Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon.
Eager to hear your feedback and reactions. If you'd like to leave an issue or better yet a PR, you can do so here [1]
[1]: https://github.com/timescale/pgai