Hacker Newsnew | past | comments | ask | show | jobs | submit | pacofvf's commentslogin

After becoming frustrated with the difficulty of implementing reliable and transparent data quality checks, I developed a new framework called Weiser. It’s inspired by tools like Soda and Great Expectations, but built with a different philosophy: simplicity, openness, and zero lock-in.

If you’ve tried Soda, you’ve probably noticed that many of the proper checks (like change over time, anomaly detection, etc.) are hidden behind their cloud product. Great Expectations, while powerful, can feel overly complex and brittle for modern analytics workflows. I wanted something that falls between lightweight, expressive, and flexible enough to integrate into any analytics stack.

Weiser is config-based; you define checks in YAML, and it runs them as SQL against your data warehouse. There’s no SaaS platform, no telemetry, no signup: just a CLI tool and some opinionated YAML.

Some examples of built-in checks:

1. Row count drops compared to a historical window

2. Unexpected nulls or category values

3. Distribution shifts

4. Anomaly detection

5. Cardinality changes

The framework is fully open-source (MIT license), and its goal is to be both human- and machine-readable. I’ve been using LLMs to help generate and refine Weiser configs, which work surprisingly well, far better than trying to wrangle pandas or SQL directly via prompt. I already have an MCP server that works well but it's a pain in the ass to install it in Claude Desktop, I don't want you to waste time doing that. Once Anthropic fixes their dxt format, I will release a MCP tool for Claude Desktop.

Currently it only supports PostgreSQL and Cube as datasource, and for destination for the checks results it supports postgres and duckdb(S3), I will add snowflake and databricks for datasources in the next few days. It doesn’t do orchestration, you can run it via cron, Airflow, GitHub Actions, whatever you want.

If you’ve ever duct-taped together DBT tests, SQL scripts, or ad hoc dashboards to catch data quality issues, Weiser might be helpful. I would love any feedback or ideas. It’s early days, but I’m trying to keep it clean and useful for both analysts and engineers. I'm also working on a better GUI.

GitHub: https://github.com/weiser-ai/weiser Docs: https://weiser.ai/docs/tutorial/getting-started

I'm happy to answer questions or hear about what other folks are doing to address this problem.


> It looks very important, popular and well established, but what is it?

It's easier to explain what Cube is if we first define what the Semantic Layer(SL) is. In a few words, the SL is the abstract representation of business objects, for example: sales, users, conversion rates, etc. Cube provides the language to define the SL, an API to access it, access control mechanisms and a caching layer. It's important to emphasize that Cube is a stand-alone SL, decoupled from any BI visualization tool. That's the "headless" part, and I would also add that is "feetless" since it supports multiple source DBs. Looker the other big name in the space has the incentive of selling you more usage of BigQuery and of locking you in with their UI, it just recently started to open up to the idea of APIs. The idea is that you have a central place where you define the SL and then you don't need to duplicate the definition on every downstream application, which may lead to errors or inconsistencies.

> Is it that it can perform a single query across multiple databases?

Cube allows you to join data from multiple databases at the caching layer, that's fundamentally differently than a federated query engine. But from the downstream application perspective it has the same outcome. By being done at the caching layer it has inherent advantages and limitations vs federated queries.

I really like these series of articles by David Jayatillake that go into deeper detail:

1. https://davidsj.substack.com/p/semantic-superiority-part-1 2. https://davidsj.substack.com/p/semantic-superiority-part-2 3. https://davidsj.substack.com/p/semantic-superiority-part-3 4. https://davidsj.substack.com/p/semantic-superiority-part-4 5. https://davidsj.substack.com/p/semantic-superiority-part-5


They are using cloudformation which predates Kubernetes and it's meant to solve the same problem. CloudFormation still uses docker, originally it used AMIs. I've worked for a lot of Fortune 500 companies that don't use Kubernetes and use Docker.


> Upon written notice to you, we may (or may appoint appoint a nationally recognized certified public accountant or independent auditor to) audit your use of the Services and Mapbox Software to ensure it is in compliance with these Terms. Any audit will be conducted during regular business hours, no more than once per 12-month period and upon at least 30 days’ prior written notice (except where we have reasonable belief that a violation of these Terms has occurred or is occurring), and will not unreasonably interfere with your business activities. You will provide us with reasonable access to the relevant records and facilities.

LOL no.


For the love of god, don't use NiFi to trigger an Airflow DAG.


Can you expand? We just set this workflow up and it seems to be working fine.


NiFi is meant for stream processing and Airflow for batch processing, if your NiFi triggers an Airflow DAG that means that your entire process is batch processing and you shouldn't use NiFi in the first place. If you still want to do stream processing then use Airflow sensors to "trigger" it.


Where is Spain and Italy in your Options?


Sorry, I didn't want to enumerate all of the minor permutations. As far as I know, Spain and Italy are also doing OPTION 3. They have no realistic post SIP plan.


"Python is slow" argument just shows complete ignorance about the subject. (and there may be good arguments for not using python)

First of all if you are doing "for i in range(N)" then you are already doing it wrong, for ML and data analytics you should be using NumPy "np.arange()", Numpy arange doesnt even run in "python" it's implemented in C. So it may even be faster than swift '..<' . Let me know when you can use swift with spark.


This is actually one of the most frustrating parts about using python. You can’t write normal python code that performs well. Instead you have to use the numpy dsl, which I often find unintuitive and too often results in me needing to consult stack overflow. This is very frustrating because I know how I want to solve the problem, but the limitations of the language prevent me from taking the path of least resistance and just writing nested loops.


my point is that the benchmark is deceiving, again if you are doing data analytics or ML, then you already are using numpy/pandas/scipy, so thats not a valid argument.


But it is. A good compiler could unroll my loop and rewrite it with the appropriate vector ops. But that isn’t possible with just python right now.


SinTrafico | Frontend, Full Stack, Backend, Data Engineering | Onsite Mexico City, Remote in Mexico or Latin America | http://sintrafico.com

SinTrafico is a profitable and growing Mexican Startup. We are the leaders in the mobility industry in Mexico and soon in Latin America and beyond.

We are looking for senior developers on different roles and we are open to do remote work and also willing to help to relocate to Mexico City.

Stack: AWS, Python, Vue.js/React.

You should definitely apply if you have experience dealing with Spatial and Geographic data.

Please send me your CV at paco@sintrafico.com


But that didn't happen, because Obama understood the rule of law, you can't defend Trump illegal actions by saying that if Obama had done that then the "leftists" would be cheering that.


Obama did not understand the rule of law. DACA is completely unconstitutional. The President cannot change immigration law as written by congress. If you believe Obama had the right to create DACA then the same broad executive powers to rewrite immigration laws should apply to Trump, right?


I had experience in Quartz (BofA cloud), and deploying is 100 times easier than AWS, everything is automated. Imagine building a cloud service where you trust all your clients, and all must share the same information if they have the correct auth, what I trying to say is that their use-case actually made it simpler, that's where the savings in software had came from.


So we'll hear about a BoA hack in a few years where the attacker got into the cloud somehow and then had unlimited access to all the other servers? Great. Exactly what I want from a bank. /s


Or you can look at it as a single correct auth and encryption mechanism is shared company wide vs each individual teams intern inventing new was to base64 encode your password. Glass half full or empty


Doesn't this mean one compromised account = all data compromised?


Well that's the same with any cloud provider, one compromised account with enough access could be catastrophic. In BofA there is an AWS IAM equivalent. Also the BofA cloud is not accesible from the Internet.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: