The initial drive for this was some of the work we did on Promscale, along with some observations I made when experimenting with potential optimizations for compression. We saw that a bunch of workloads involved exporting datasets to external tools such as Prometheus and pandas, which would materialize them in memory, and performing analyses there. Furthermore, the queries in the external tool would often be more readable than the equivalent SQL. I'd also observed earlier, while experimenting with decompression optimizations, that, as long as it was small enough, materializing the dataset this way could be surprisingly efficient. This, along with some dissatisfaction in the difficulty of writing certain kinds of queries eventually led to this experiment.
> Is this meant to be a "short-cut" to express complicated SQL queries, or is this meant to adding new semantics beyond SQL?
This is a surprisingly difficult question to answer; due to Postgres's extensibility, the line around "new semantics" is a bit blurry. I believe that everything we implemented could be desugared into SQL queries, though it's possible it would need some custom functions. However, we don't actually do this yet: right now the pipelines are implemented entirely with custom functions and operators without touching the planner at all. There ended up being a 3-way tradeoff between experimentation speed, execution speed, and generality. While we could theoretically get the best performance on large datasets by by adding a translate pass expanding timevector pipelines into subqueries, this ends up in some difficult and brittle code, which isn't worth implementing at this stage in the experiment. In addition, it doesn't necessarily result in the best code for smaller datasets or less general workloads since there are overheads to its generality. Since our hypothesis right now is that there's a demand for tools to work with smaller datasets, we started out with the simpler implementation.
> Kusto has a similar data type "series" that is also created from aggregating over some columns
Cool! I'll have to look into it some point. I think Amazon Timestream also has something similar.
Thanks for the background. I find it fascinating that the small-data scenarios in analytics are still kind of chaotic when it comes to tooling. Full-fledged SQL queries on relations seems heavy but closer to raw data. The timevector custom data type is like a middle ground. Each timevector is essentially a pre-aggregated time series (maybe compressed also) so approach likely adds performance benefit when the task is to analyze many many small time series. Although I still feel supporting 70+ new functions adds a lot of maintenance burden, and people cannot debug/extend this set of functions because they are not SQL. I am wondering if you often find that users just want an out-of-box solutions or they need to have the ability to tweak or adding their own domain-specific logic.
The initial drive for this was some of the work we did on Promscale, along with some observations I made when experimenting with potential optimizations for compression. We saw that a bunch of workloads involved exporting datasets to external tools such as Prometheus and pandas, which would materialize them in memory, and performing analyses there. Furthermore, the queries in the external tool would often be more readable than the equivalent SQL. I'd also observed earlier, while experimenting with decompression optimizations, that, as long as it was small enough, materializing the dataset this way could be surprisingly efficient. This, along with some dissatisfaction in the difficulty of writing certain kinds of queries eventually led to this experiment.
> Is this meant to be a "short-cut" to express complicated SQL queries, or is this meant to adding new semantics beyond SQL?
This is a surprisingly difficult question to answer; due to Postgres's extensibility, the line around "new semantics" is a bit blurry. I believe that everything we implemented could be desugared into SQL queries, though it's possible it would need some custom functions. However, we don't actually do this yet: right now the pipelines are implemented entirely with custom functions and operators without touching the planner at all. There ended up being a 3-way tradeoff between experimentation speed, execution speed, and generality. While we could theoretically get the best performance on large datasets by by adding a translate pass expanding timevector pipelines into subqueries, this ends up in some difficult and brittle code, which isn't worth implementing at this stage in the experiment. In addition, it doesn't necessarily result in the best code for smaller datasets or less general workloads since there are overheads to its generality. Since our hypothesis right now is that there's a demand for tools to work with smaller datasets, we started out with the simpler implementation.
> Kusto has a similar data type "series" that is also created from aggregating over some columns Cool! I'll have to look into it some point. I think Amazon Timestream also has something similar.