Hacker Newsnew | past | comments | ask | show | jobs | submit | adrianbr's commentslogin

The whole idea is pretty nuts! I can imagine it being used as the dev env for teams that use SQLMesh and can thus port the sql. Might be worth investigating with them


That's really interesting! Could you tell me a bit more of what you're thinking? I'm not the most familiar with SQL Mesh and the typical workflows there.


Perhaps similar to https://github.com/duckdb/dbt-duckdb , but SQLMesh instead of DBT obviously.


Ah gotcha! Do you have a use case where you'd look to remodel/transform the data between warehouses?


Not the original parent, so unsure of their use-case. But I've seen the approach where some/basic development can be done on duckdb, before making its way to dev/qa/prd.

Something like your project might enable grabbing data (subsets of data) from a dev enviroment (seed) for offline, cheap (no SF warehouse cost) development/unit-testing, etc.


This makes sense, thank you!


That's really cool, did you already saw the dlt library? That one's done for very easy to use EL in python. It's similarly modular and built by senior data engineers for the data team, and the sources are generators which you could probably use too.

How is koheesio different to dlt? Where could they complement each other?


congrats on the hard work and this launch!


thank you!


ahh good old manual fine tuning and maintenance. We are adding data contracts for things like event ingeston where schema needs to be strict or cases where you know ahead of time what to expect.

Our experience comes from startups that usually do not have time to track down the knowledge and rather go out and find/make their own. Here you definitely want evolution with alerts before curation - so load to raw, and curate from there. Picking out data out of something without a schema is called "schema on read" and you can read about its shortcomings. So this is both robust and practical.

For the fine tuning, as I mentioned, data contracts are a PR review and some tweaks away. They will be highly configurable between strict, rule based evolution, or free evolution. Definitely use alerts for curation of evolution events!


Fair enough, especially if explicit alerting is involved.

Have you considered a hybrid solution, something that generates a contract from a large corpus of data, which can then be deployed statically?

I consider "responding to change" as a somewhat different scenario from "heterogeneous but not changing". So statically generating a contract from an existing corpus supports the latter.

I could also envision some kind of graceful degradation, where you have a static contract, but you have dynamic adjustments instead of outright failures if the data does not conform to that contract.


I worked with dlt guys on exactly that. Using OpenAI functions to generate a schema for the data based on the raw data structure. You can check that work here: https://github.com/topoteretes/PromethAI-Memory It's in the level 1 folder


Duckdb is analytical and gained popularity with the analytics crowd. it has multiple features that make it play well with use cases in that ecosystem such as aggregation speed, parquet support, etc


Thank you for the heads up!

it is unfortunate, and with 3 letter acronyms this will happen.

An easy way to remember is that we are the one you can pip install and plays well in the ecosystem.

Databricks has interesting choice in marketing names, ngl - DLT named after the competing dbt - Renaming standards like raw/staging/prod to Bronze, Silver, Gold


I had a related problem when searching for "dlt on aws lambda"... Google thinks "dlt" is an abbreviation of "delete" and returns results accordingly


for now :) Thanks for pointing it out - and it looks like we should add an aws lambda guide too :)

If you want to deploy to lambda, try asking in the slack community, some folks there do it.

Or if you wanna try yourself, here is a similar guide that highlights some concerns from deploying on gcp cloud functions https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deplo...


We hear a lot about the dlt & AWS Lambda. We have currently one user working on the use case (see our Slack https://dlthub-community.slack.com/archives/C04DQA7JJN6/p169...)


This is amazing! to figure out the website apis has always been a huge pita. With our dlt library project we can turn the openapi spec into pipelines and have the data pushed somewhere https://www.loom.com/share/2806b873ba1c4e0ea382eb3b4fbaf808?...


There are multiple ways to run together - we will show a few in a demo coming out soon.

We also consider a tighter integration like with Airflow described here as a possible next step https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deplo...

We will investigate the interest incrementally as to not build any plugins that don't end up used.


For an example of prior art, you should look into Astronomer's Cosmos Library to see how they integrate Dbt into Airflow.


Thank you! That's the example we looked at for our dlt-airflow integration :) the dlt dag becomes an airflow dag.


Since dlt generates a schema, and tracks evolution etc, contains lineage, and follows data vault standard it can easily provide metdata or lineage info to the other tools.

At the same time, dlt is a pipeline building tool first - so if people want to read metadata from somewhere and store it elsewhere, they can.

If you mean to take metadata like we integrate with arrow - that remains to be seen if the community might want this or find it useful, we will not develop plugins for collecting cobwebs, but if there are interested users we will add it to our backlog.


Thanks for the response. I also noticed there was a mention of data contracts or Pydantic to keep your data clean. Would it make sense to embed that as part of a DLT pipeline or is the recommendation to include it as part of the transformation step?


You can use pydantic models to define schemas, validate data (we also load instances of the models natively): https://dlthub.com/docs/general-usage/resource#define-a-sche...

We have a PR (https://github.com/dlt-hub/dlt/pull/594) that is about to merge that makes the above highly configurable, between evolution and hard stopping: - you will be able to totally freeze schema and reject bad rows - or accept the data for existing columns but not new columns - or accept some fields based on rules'


you can request a source or a feature by opening an issue on sources/dlt repo https://github.com/dlt-hub


Thank you for the feedback! I can see now how it could be confusing.

The reason we used chatgpt is because it's an easy starting point - why read through examples when you can get the one you want in seconds?

Because dlt is a library, it's closer to how language works and gpt can just use it - from our experiments, we cannot say the same about frameworks.


That example looks completely opaque to me. Not only does it obfuscate what's actually happening by using some incomplete code from a chatbot, but it also is an actually relevant to the task at hand, which is to demonstrate your library, not to demonstrate some beginner level API data access. Skimming over it, I couldn't tell where your library actually got involved at all, it just looked like a couple of functions to access data, followed by links to your documentation. I suggest dumping the whole thing and starting with a more coherent demo that focuses on the features of the tool you actually built, not on features of irrelevant systems.


Yes I agree. Based on what they show compared to what they say, I'm not really sure what this library actually does.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: