The whole idea is pretty nuts! I can imagine it being used as the dev env for teams that use SQLMesh and can thus port the sql. Might be worth investigating with them
That's really interesting! Could you tell me a bit more of what you're thinking? I'm not the most familiar with SQL Mesh and the typical workflows there.
Not the original parent, so unsure of their use-case. But I've seen the approach where some/basic development can be done on duckdb, before making its way to dev/qa/prd.
Something like your project might enable grabbing data (subsets of data) from a dev enviroment (seed) for offline, cheap (no SF warehouse cost) development/unit-testing, etc.
That's really cool, did you already saw the dlt library? That one's done for very easy to use EL in python. It's similarly modular and built by senior data engineers for the data team, and the sources are generators which you could probably use too.
How is koheesio different to dlt? Where could they complement each other?
ahh good old manual fine tuning and maintenance. We are adding data contracts for things like event ingeston where schema needs to be strict or cases where you know ahead of time what to expect.
Our experience comes from startups that usually do not have time to track down the knowledge and rather go out and find/make their own. Here you definitely want evolution with alerts before curation - so load to raw, and curate from there. Picking out data out of something without a schema is called "schema on read" and you can read about its shortcomings. So this is both robust and practical.
For the fine tuning, as I mentioned, data contracts are a PR review and some tweaks away. They will be highly configurable between strict, rule based evolution, or free evolution. Definitely use alerts for curation of evolution events!
Fair enough, especially if explicit alerting is involved.
Have you considered a hybrid solution, something that generates a contract from a large corpus of data, which can then be deployed statically?
I consider "responding to change" as a somewhat different scenario from "heterogeneous but not changing". So statically generating a contract from an existing corpus supports the latter.
I could also envision some kind of graceful degradation, where you have a static contract, but you have dynamic adjustments instead of outright failures if the data does not conform to that contract.
I worked with dlt guys on exactly that. Using OpenAI functions to generate a schema for the data based on the raw data structure. You can check that work here: https://github.com/topoteretes/PromethAI-Memory
It's in the level 1 folder
Duckdb is analytical and gained popularity with the analytics crowd. it has multiple features that make it play well with use cases in that ecosystem such as aggregation speed, parquet support, etc
it is unfortunate, and with 3 letter acronyms this will happen.
An easy way to remember is that we are the one you can pip install and plays well in the ecosystem.
Databricks has interesting choice in marketing names, ngl
- DLT named after the competing dbt
- Renaming standards like raw/staging/prod to Bronze, Silver, Gold
Since dlt generates a schema, and tracks evolution etc, contains lineage, and follows data vault standard it can easily provide metdata or lineage info to the other tools.
At the same time, dlt is a pipeline building tool first - so if people want to read metadata from somewhere and store it elsewhere, they can.
If you mean to take metadata like we integrate with arrow - that remains to be seen if the community might want this or find it useful, we will not develop plugins for collecting cobwebs, but if there are interested users we will add it to our backlog.
Thanks for the response.
I also noticed there was a mention of data contracts or Pydantic to keep your data clean. Would it make sense to embed that as part of a DLT pipeline or is the recommendation to include it as part of the transformation step?
We have a PR (https://github.com/dlt-hub/dlt/pull/594) that is about to merge that makes the above highly configurable, between evolution and hard stopping:
- you will be able to totally freeze schema and reject bad rows
- or accept the data for existing columns but not new columns
- or accept some fields based on rules'
That example looks completely opaque to me. Not only does it obfuscate what's actually happening by using some incomplete code from a chatbot, but it also is an actually relevant to the task at hand, which is to demonstrate your library, not to demonstrate some beginner level API data access. Skimming over it, I couldn't tell where your library actually got involved at all, it just looked like a couple of functions to access data, followed by links to your documentation. I suggest dumping the whole thing and starting with a more coherent demo that focuses on the features of the tool you actually built, not on features of irrelevant systems.