Hacker News new | past | comments | ask | show | jobs | submit login
SQL for data scientists in 100 queries (gvwilson.github.io)
834 points by Anon84 on Feb 6, 2024 | hide | past | favorite | 158 comments



Not to detract from the article, but: Wow the meaning of the term "data scientist" has changed since the days of "sexiest job". From the article description:

- Rachel has a master’s degree in cell biology and now works in a research hospital doing cell assays.

- She learned a bit of R in an undergrad biostatistics course and has been through the Carpentries lesson on the Unix shell.

- Rachel is thinking about becoming a data scientist and would like to understand how data is stored and managed.

Data Scientists, back in the day, were largely people with both a fairly strong quantitative background and a strong software engineering background. The kind of people who could build a demo LSTM in an afternoon. Usually there was a bit of a trade-off between the quant/software aspects (really mathly people might be worse coders, really strong coders might need to freshen up on a few areas of mathematics), but generally they were fairly strong in each area.

In many orgs it's been reduced to "over paid data analysts" but I wouldn't even hire "Rachel" for a role like that.


No, it hasn't.

The term has always gone in a half-dozen directions at once, and ranged anything from

* an idiot making PPT decks for business presentations based on sales data; to

* a statistician with very sophisticated mathematical background but minimal programming skills doing things in R or State; to

* a person with a random degree making random dashboard in Tableau; to

* a person with sophisticate background in software engineering, data engineering, and related fields who can kind of do math

* an expert in machine learning (of various calibers)

* a physicist using their quantitative skills to munge data

... and so on. That's been confusing people since the title came out. It depends on the industry, and there's a dozen overlapping titles too, some with well-defined meanings and some varying from company to company (business analyst, data engineering, etc.).


This is so true. Outside of maybe FAANG companies, a lot of places have wildly different expectations for that role. While one company may refer to the guy doing simple PPTs as a business analyst, others might call that a data analyst or a data scientist or something else. The pay probably mostly reflects the truth though outside of exceptions from office politics.


Relatedly - and I’m a lead data engineer at my current $JOB — I’ve yet to find a definition of what a data engineer is/does that I find easy to share with people. Of course I have flippant ones (YAML dev with a bit of Python) but nothing more than: Database Admins who learned Python and now care about more of the data lifecycle than the data that resides in the DBs they managed.


As a data engineer do you find it your job to transform and clean data? How much AI stuff do you implement that does data transformations?


Not a data engineer per-se, but I do a lot of data stuff.

I find AI has revolutionized anything one-off, and data stuff has a lot of one-off. 80% of the time, I can ask the LLM and get a solution which would take 1-2 hours to code which works.

* Verification of correctness is unnecessary or nominal, since it only needs to work in one case. If it doesn't handle complex corner cases as come up in software systems, there aren't any. And in most cases, the code is simple enough you can verify at a glance.

* Code quality doesn't matter since it's throw-away code.

It's something along the lines of:

"I have:

[cut-and-paste some XML with a hairy nested JSON structure embedded]

I want:

[write three columns with the data format I want, e.g. 3 columns of CSV with the only the data I need]"

Can you make a Python script to do that?

[Cut-and-paste script, and see if it works]

If it does, I'm done. If it doesn't, I can ask again, break it into simpler steps, ask it to debug, or do it by hand. Almost no time lost up to this point, though, and 80% of the time, I just saved two hours.

In practice, this means I can do a lot more prototyping and preliminary analysis, so I get to better results. Deadlines, commitments, and working time has not changed, so the net result is much higher quality output, holistically.


I think I need to revisit my hesitation to using LLMs. I think it stems from stubbornness. I’d rather write the boilerplate and the code to do the transformation or do so in DuckDb in SQL but if the tool can do it well enough so be it.

The bit about one-offs is not my experience. The idea being that writing connector code or extract stuff or even data cleaning changes based on the source and is usually put in production.

Ideal would be an endpoint to send data to like your example with sample data and then have it return after a prompt with the code needed or bypass the code just give me the subset of data that I request.


This discussion started with what a data engineer does, and the diversity of roles. I wasn't trying to push my workflow on anyone. With what I'm doing right now (which includes data and SWE; I'm now writing more holistically), there is a flow:

---

Step 1:

- I do a lot of exploratory and one-off analysis, some of which leads to internal memos and similar and some of which goes nowhere. I do a lot of prototyping to.

- I do a lot of whiteboarding with stakeholders. This is also open-ended and exploratory. I might have a hundred mock-ups before I build something which would go into prod (which isn't a lot of time; a mock-up might be 5 minutes, so a hundred represents a few days' time).

This helps make sure: (1) I have enough flexibility in my architecture to guide likely use-cases, and I don't overengineer for things which will never happen (2) I pick the right set of things.

---

Step 2:

I build high-fidelity versions of the above. These, I can review e.g. with focus groups, in 1:1s, and in meetings.

---

Step 3:

I build production-ready deployable code. Probably about a third of the thing in step 2 reach step 3.

---

LLMs do relatively little for step 3. If I have time, I'll have GPT do a code review. It's sometimes helpful. It sounds like you spend most of your time here, so you might get less benefit than I do.

For step 2, they can often build my high-fidelity mockup for me, which is nice. What they can't do yet is do so in a way which is consistent with the rest of my codebase (front-end theming, code style, tools used, etc.). I'll get something working end-to-end quickly, but not necessarily something I can leverage directly for step 3.

However, in step 1, they've had a transformational impact. Exploratory work is 99% throw-away code (even the stuff which eventually makes it to prod; by that point, it has a clean rewrite).

One more change is that in step 1, I can try different libraries and tools too. LLMs are at the level of a very junior programmer, which is a lot better than me in a tool I've never used. Evaluating e.g. a library might be a couple of days of learning with the equivalent of 5 minutes - 1 day of building (usually, to figure out it's useless for my use-case). With an LLM, I have a feasible lousy first version in minutes. This means I can try a half-dozen libraries in an hour. That didn't fit into my timelines pre-LLM, and definitely does now. I end up using better libraries in my code, which leads to better architecture.

So YMMV.

I'm posting since I like reading stories like the above myself. Contexts vary, and it's helpful to see how things are done in contexts others than my own. If others have them, please feel free to share too.


Hmm. I’m pretty convinced. I often get mired in steps 1 and 2. Partly because of my ADHD but also cuz of the tedium of writing code.

Which LLM are you using? ChatGPT enterprise? Something offline data / sql centric?


ChatGPT a plurality of the time. I go through the API with a little script I wrote. If it doesn't work, I step up to GPT4. The costs are nominal (<$3/month). ChatGPT has gotten worse over time, so the need to escalate is more frequent; when it first came out, it was excellent.

API (rather than web) is more convenient and avoids a lot of privacy / data security issues. I wouldn't use it for highly secure things, but most of what I do is open-source, or just isn't that special.

I have analogous scripts to run various local LLMs, but with my setup, the init / cooldown time is long enough that it's easier to use a web API. Plus my GPU is often otherwise occupied. Most of what I use my GPU for are text (not code) tasks, and I find the open source models are good enough. I've heard worse things about them for code, but I haven't experimented enough to see if they'd be adequate. Some of that is getting used to how the system works, good / bad prompts, etc.

ollama + a second GPU + a running chat process would likely solve the problem for around ≈$2k, so about the equivalent of a bit over a half-century of calls to the OpenAI API. If I were dealing with something secure, that'd probably make sense. What I'm doing now, it doesn't.


That’s a good question. I think LLMs will have a place in the connector space. It would be really cool if they could dynamically handle changes in the source (the api changed and added some new data new columns etc). But right now — at least I — don’t trust AI to do much of anything in terms of ingestion. When data is extracted from the source it’s got to be as close to a 1:1 of the source as possible. Any errors introduced will have a snowball effect down the line.

For data cleaning we do tend to write the same sort of things over and over. And that’s where I think things could improve. Though what makes a data engineer special in my mind is that they get to know the nuances of data in detail. They get familiar with the columns and their meanings to the business and the expected volume and all sorts of things. And when you get that deeply involved with the data you clearly see where things are jarringly and almost like a vet to a sick animal you write data cleaning things because you care about the data that much.


As a Data Engineer I find my job is about finding managing, and deleting PII data.


Hah. If tou have found a systematic way of doing this please share.


Reach out to me in six months, and I should have one up on github if I'm lucky :)

(I've had a design in the works for years, and finally should have time and budget to implement it. Probably not helpful for legacy systems, though.)


I hope to see this on HN’s show HN! But let me put something in the calendar ;-)

There’s no email or other such thing in your bio. :(


PII data innit


A joke I read shortly after the term Data Scientist was introduced:

Data Scientist - a statistics major living in San Francisco


Lets face it job titles are a lot of bullshit. I was a "programmer". I call myself a "software engineer". I probably do better SQL than many data engineers / scientists, which is getting annoying as I am shoehorned into roles where I plug API's together rather than deal with SQL. But Data engineer roles, always want a load of stuff I have never needed to deal with.


Data scientists with a strong software engineering background , where are they hiding?

Jokes apart there used to be two categories of data scientists, those that came from a science/phd background where they duct taped their mathematical understanding to code which might work in production, and those those that come from a CS background that duct taped their mathematical/medium tutorial knowledge to an extravaganza of grid search and micro-services that made unscientific predictions in a scalable way.

So now we have the ml engineer (engineer) and the data scientist (science) with clear roles and expectations. Both are full time jobs, most people cannot to both.


all 5 of them are at Alphabet/Meta/OpenAI, no?

but more seriously, unless someone's explicitly doing ML research for most applications using something off-the-shelf-ish[0] and tinkering with it works best. and this mostly requires direct experience[1] with the stack.

and sure, of course, if said project/team/org/corp has so much money they even can train their own model, sure, they can then afford to have these separate roles with "more dedicated" domain experts.

[0] from YOLO to LLaMa to whatever's now on HuggingFace

[1] the more direct the better. you have used LLMs before? great. pyTorch? great. you can deploy stuff on k8s and played with ChatGPT? well, okay, that's ... also great. you know how to get stuff from Snowflake/Databricks/SQL to some training job? take my money!


The term sharded into multiple different terms

Strong coder who can implement an LSTM = ML Engineer

Decent coder who can implement a recent paper with scaffolding code = Applied Scientist

Acceptable coder who is good enough at math to innovate and publish = Research Scientist

Strong coder who cares about data = Data Engineer

Acceptable coder who has lots of domain knowledge = Business analyst, Data Scientist.

If you're just a Data scientist without any domain knowledge...... then you're in a precarious career position.


I've seen a disturbing rise in the number of people who think data engineering isn't software engineering. I don't plan to play up that part of my experience the next time I'm applying.


It's because data engineering has been reduced to be able to login to a cloud provider and know which workflow to drag and drop. This is easily learned in a couple of weeks so that s why those skills might not be considered software engineering.


Well, that's the GP's point, I guess: this thing was called "Business analyst", and, honestly, I don't know what being a domain-expert with somewhat above-average computer skills has to do with "data science".


I guess I'm a data scientist then, that sounds better than business analyst.



> people with both a fairly strong quantitative background and a strong software engineering background.

In my experience this intersection is a null set. And not just that it's an extremely rare feat to pull off IMO, the mental bandwidth and time needed to be good at one of those two alone would consume one person fully. This is why quant/stat specialists were paired with ETL/data-pipeline specialists to build end to end solution.

One reason Data Science became such a hot role back in the day was that it was amorphously defined; because no one knew what exactly it entailed folks across a broad range of skill sets (stats, data engineers, NoSQL folks, visualisation and so on) jumped into the fray. But now companies have burnt their hands, they have learnt to call out exactly what's needed; even when they advertise for DS role they specify what's required of them. For example, this page on Coursera[1] is clear about emphasis on Quant, which is a welcome development IMO.

[1] https://www.coursera.org/articles/what-is-a-data-scientist


Yes, unfortunately when it was declared the Sexist Job, there was a tremendous influx of bootcamps promising you a six figure income after 3 months of part-time study. That has certainly lowered the overall quality of the Data Scientist title.


One funny aspect about the changing definition of "data scientist" is that I, currently a data scientist, spend most of my professional day working with the LLM/AI modeling areas nowadays and building custom models instead of building analyses and dashboards, since the former is more impactful.

Job positions still want the latter, though. If I ever left my job I'm not confident I could get another job with the Data Scientist title, nor could I get a "ML Engineer" job since those focus more on deployment than development.

My R is embarrassingly rusty nowadays and I miss making pretty charts with ggplot2.


No that's MLE. A DS rarely gets asked leetcode algos questions, an MLE would.


I literally was asked two leetcode questions verbatim when interviewing for a data science position at TikTok a few months ago. Dynamic programming (I won't mention which question) and then one regarding binary trees.


> (I won't mention which question)

You must protect the corporate overlords.


Alternatively, protect themselves since giving away an individualized question could identify them.


The question does not matter at all.

All of the information in the knowledge of Leetcode + category.

(Does it really matter WHICH question?? They are different but all the same. That is the point.)


Depends on the company, a research level MLE would be asked to derive loss functions and perform partial differentiation on pen and paper. You have to answer questions like what Kullbeck Leiber divergence is and how it can be utilized etc.


Would that be a lesser known cousin of the better known Kullback–Leibler separation measure for distributions?


Full marks for snark, but points off for being incorrect.

https://en.m.wikipedia.org/wiki/Kullback%E2%80%93Leibler_div...


From your link:

    a measure of how one probability distribution P is different from a second,
ie. literally it's a separation measure for distributions .. just as I recalled from my first encounter with the notion ~ 1984 (ish).

If you're sincere you should either add those points back or, preferably, expand upon your theory of how my snap take is incorrect.

( I'm aware it's not a metric due to triangle inequality, etc. )


The snap take came across as an argument about which of two names for the measure is better-known.

The wikipedia page implies the opposite of that argument.

Perhaps that’s changed since 1984, but the proposition was about current practices.


It's been Kullback since birth in 1907 to the best of my knowledge, never once Kullbeck.

As a fully anglicized US citizen born in Brooklyn, New York I don't think there's ever been any vowel confusion over the spelling of the name:

https://en.m.wikipedia.org/wiki/Solomon_Kullback

Admittedly I did check as it's not uncommon for mathematicians to have alternate spellings for their names.

Ditto Leibler, born Chicago, Illinois in 1914, no dropped L

https://en.m.wikipedia.org/wiki/Richard_Leibler


You must have worked at different places from me. Nearly every DS job I had (before wisely apparently) leaving that area had leetcode style algo questions during the interviews.

Again, things have apparently changed.


>Data Scientists, back in the day, were largely people with both a fairly strong quantitative background and a strong software engineering background. The kind of people who could build a demo LSTM in an afternoon.

As a field of "science" perhaps.

In real life (when it became hot) data scientists mostly meant "devs doing analytics" and a lot of it involved R and Python, or the term "big data" thrown around for 10GB logs, and things like Cassandra, with or without some background in math or statistics.

What it never has been, in practice, was a combination of strong math/statistics AND strong software engineering background. 99.9999% of the time it's one or the other.


In the Enterprise, the best qualified person for any specialisation you may want has always been whoever IBM/Oracle/Tata has sitting on their bench that week.

They have the courses and certifications to prove it, too. It's magic!


i think you may be overestimating the avg. past data scientist's software engineering chops, but it's definitely true that the term has become more diluted than ever

you still find these kinds of people and roles at smaller companies but at largecorps, what's the point? the interesting modelbuilding you shunt off to your army of phd-holding research scientists. deploying models and managing infra goes to MLE. what's left is the data analyst stuff, which you repackage as "data science" because cmon, "analytics"? are we dinosaurs? this is modern tech, we have an image to uphold!

there's not really a need for, or supply of, people who can do everything (edit: _at largecorps_, obviously)


>There's not really a need for

Oh sure, if you have teams of research scientists and machine learning engineers to shunt the work to. That's, like, what? 5% of companies out there? Less?

No need, indeed.


so why exactly did you skip the first sentence of the paragraph so that you could make a self-evident point?

anyway that 5% hires a disproportionately larger # of "data scientists"


Sadly in practice data scientist has always been person who can present data which supports what his/her boss expects.

edit: title has been updated: https://github.com/gvwilson/sql-tutorial/commit/14d1e57b94a8...


I use the following table (edit: table turned out ugly, sorry)

  | domain knowledge | quantitative knowledge  | technical knowledge  |
--------------------------------------------------------------------------------------

data analyst | high | mid | low |

data engineer | low | mid | high |

data scientist | mid | high | mid |


Is quantitative knowledge "knowing stats"?


I really wonder whose fault is it. Unfortunately, what I see the most are many companies expecting you to be a jack of all trades (you should have GenAI/LLM skills, ML, Data Engineering, and what not)


as the name implies a data scientist is a scientist that works on data. There is no reference to the need to be able to code a LSTM in one afternoon (and it would be absurd for most DS tasks)


I remember when it was a pejorative, literally.


I hate to be "that guy" but I find it a little bit sexist that the noob is called "Rachel". OK OK I'm gone.


Really excellent concise SQL guide, and great teaching by example. At most universities in the US this content gets taught over an entire semester (and probably shouldn't be). This guide is complete enough for 99% of industry SWE jobs outside of database optimization.

It's a great service when someone takes the time to document knowledge on a single page with quality examples, and trust the reader to follow along. Reminds me of the Rudin analysis book.


There is a W3Schools online tutorial that is excellent. I had all the new engineers do it before they started running queries at work. This guide is great too.


Exactly this, bookmarked.


Shameless Plug: If anyone here wants to practice their SQL, they are welcome to try my Mac app: TextQuery [1]. I built it because I wanted to quickly import CSV datasets and run SQL queries on them. I don't think there could be a more fun way to learn SQL than to jump in and start analyzing thousands of public datasets. Sure, you can use CLI/Code as well, but GUI is often faster and easier. Currently, the app is in the beta period and free-to-use. When launched, you'll get to keep the latest beta version.

[1]: https://textquery.app/


Why do you need to harvest email addresses?


Nice. I built some code that did this about a ~6 years ago, but it had no GUI and had to deal with some very irregular csv files. I always wondered why the default SQLite didn't have better support for reading CSV directly and auto ascertaining data types.


This is a very interesting idea! Great work


Been meaning to learn SQL, so will check this out


Isn’t this just a ui over SQLite?


There's clearly some non-SQLite functionality that seems really neat!


Super cool! Thanks for sharing.


This is brilliant.


As mentioned by others, the author Dr. Greg Wilson has written/compiled many books/tutorials which I can recommend. I would especially laud Software Design by Example [1][2], The Architecture of Open Source Applications [3] and Teaching Tech Together [4].

[1] https://third-bit.com/sdxpy/ (python version) [2] https://third-bit.com/sdxjs/ (js version) [3] https://aosabook.org/ [4] http://teachtogether.tech/


This is from Greg Wilson (https://third-bit.com), co-editor of "The Architecture of Open Source Applications" and many other superb references.

Edit: He wrote about this project here: https://third-bit.com/2024/02/03/sql-tutorial/


Before flights with patchy/no wifi, I often download a long, single-page tutorial. This is perfect. Curious if anyone knows of any for other languages/tech (e.g. beyond SQL).



Funny on that one, there is so much but not SQLite :-)



Wow - gave me a goofy smile.

Also cool 'Zig Zen': https://ziglang.org/documentation/master/#toc-Zen


Shameless Plug:

- Not single page _per se_ but I have plenty of Jupyter Notebook based tutorials here: https://github.com/DataForScience/

Self contained with slide decks and notebooks.


These are great, thanks!


Got any other favorite tutorials you've used for this kind of flight?


I made a couple of my own (they're incredibly scrappy, probably won't make sense to anyone but me):

- Ruby: https://github.com/stevecondylios/ruby-learning-resources/bl...

- Rails: https://github.com/stevecondylios/ruby-learning-resources/bl...

But this vim cheatsheet was great:

- https://vim.rtorr.com/

A failed attempt was to load (very) many (e.g. about 100) pages of javascript lessons from w3schools before the plane took off, but for some reason the pages tried to refresh during the flight and I lost them all, so that was a massive waste of time (opening them all before the flight took about 20 minutes).


Thanks! Bummer about the js course



each one of Peter Norvig's etudes at https://github.com/norvig/pytudes

from the sudoku solver http://norvig.com/sudoku.html to things like "NPL in python" https://colab.research.google.com/github/norvig/pytudes/blob...

...

https://cryptopals.com/ (a remake of the Matasano crypto challenges) understand crypto by actually building and then subsequently breaking it (it's not strictly single-page, but wget can mirror it nicely)


Rust's tutorial isn't single page but it comes with the compiler.

Just run `rustup docs --book` after installing rustup


Don't forget the venerable SQLZoo - I have referred a zillion people to it over the years. https://www.sqlzoo.net/wiki/SQL_Tutorial

Edit: also an inaccuracy that's minor but can bite you if you're not careful - they mention temporary tables are in memory not on disk - that's not true in almost all sql databases, they are just connection specific eg they don't persist after you disconnect.

Some databases are optimized for a temp table to be a throwaway, but that can be a good or bad thing depending on the use case.


I am also a fan of Mode's SQL Tutorial: https://mode.com/sql-tutorial


Where is the datasources to practice them?


Each query has a built in data source/test but dbfiddle and friends are great if you want to do some setup.


The tutorial can be downloaded here: https://github.com/gvwilson/sql-tutorial/raw/main/sql-tutori...

(gives the penguins.db file necessary for the examples)


> full outer join (also called cross join) constructs their cross product

Full outer joins and cross joins are different types of joins. A cross join returns the Cartesian product of both tables, while a full outer join is like a combination of a left and right join.

Better explanation here: https://stackoverflow.com/questions/3228871/sql-server-what-...


Could have done with a visual representation of the starting tables and resulting tables to be a good description.It was 20 years ago that I was learning about cartesian joins. When I do SQL I visualize an excel like structure in my head.


> left outer join

> A join that is guaranteed to keep all rows from the first (left) table. Columns from the right table are filled with actual values if available or with null otherwise.

This wording only works for identity equality join condition. It creates misleading mental model of left joins, and unfortunately is very common.


I'm not sure I understand, I think this definition still works for left outer joins on conditions other than identity equality, since joins on, say, inequalities or multiple conditions would still be "guaranteed to keep all rows from the first (left) table. Columns from the right table are filled with....".


It kind of implies that the left join process works by first taking the left table and then filling in the right table with a match if one exists, and otherwise null.

That model obviously doesn't work because if there's more than one match as the matching left row is duplicated for each match. However I don't understand their point of this being a problem when you don't have a "identity equality join condition", since this can also occur for equality joins as long as you're not joining on a unique key.


Can you elaborate with a gotcha example?


I assume they mean that row multiplication can occur but otherwise not sure.


I'd say it's misleading at best. It should specify that for multiple matches in the right table, the row from the left table is duplicated per matching row in the right table.


Can you clarify? Is the problem here that the "guaranteed to keep all rows from the left table" piece is accurate, but the section about the right table might not be a good mental model if the join condition is more complex than a simple equality?


There was discussion about this exact same thing about 17 days ago. It comes up surprisingly frequently. https://news.ycombinator.com/item?id=39071550


Also StrataScratch.com for leetcode for SQL.

What really distinguishes an SQL master is working with queries hundreds of lines long, and query optimization. For example, you can often make queries faster by taking out joins and replacing them with window functions. It's hard to practice these techniques outside of a legit enterprise dataset with billions of rows (maybe a good startup idea).


one of the most performance-killing practices i’ve seen (outside of brutal orm-generated queries) is the sql hero who wants to do everything all at once in a single splat of sql. at a certain level of complexity, breaking things up into logical units and even forcing materialization (instead of hoping for some good intermediate materialization) can go a long way


A lot of these queries are only valid on SQLite (or maybe not "only" but certainly not generally available across DBMS's), including cases where the author doesn't point this out. For one example filtering on aggregates is not supported by MySQL, MS SQL, Oracle or most other DBMS's.


> For one example filtering on aggregates is not supported by MySQL, MS SQL, Oracle or most other DBMS's.

HAVING is one of the standard clauses, I use it on mysql all the time and a quick search shows it exists for the others.


I'm not talking about HAVING, I'm talking about item #25: filter aggregate inputs. Queries like

    select
        sex,
        round(
            avg(body_mass_g) filter (where body_mass_g < 4000.0),
            1
        ) as average_mass_g
    from penguins
    group by sex;


Learning outcomes:

* Explain the difference between a database and a database manager.

* Write SQL to select, filter, sort, group, and aggregate data.

* Define tables and insert, update, and delete records.

* Describe different types of join and write queries that use them to combine data.

* Use windowing functions to operate on adjacent rows.

* Explain what transactions are and write queries that roll back when constraints are violated.

* Explain what triggers are and write SQL to create them.

* Manipulate JSON data using SQL.

* Interact with a database using Python directly, from a Jupyter notebook, and via an ORM.



I'm sorry but Clickhouse isn't SQL-compliant so it has no business teaching us JOIN.


No database is fully SQL compliant. Though clickhouse can at times be painfully so.

https://stackoverflow.com/questions/784900/why-does-no-datab...


> PostgreSQL supports most of the major features of SQL:2023. Out of 177 mandatory features required for full Core conformance, PostgreSQL conforms to at least 170. In addition, there is a long list of supported optional features.

https://www.postgresql.org/docs/current/features.html

Clickhouse is simply a joke in comparison. No basic cursor support, no transactions, incomplete comparison operators, no single-row SELECT with GROUP BY and HAVING clauses grouped views, no procedures, no anti-joins or WHERE EXISTS, for that matter. The list goes on... it's basically impossible to write SQL to any degree of sophistication in Clickhouse.


> anti-joins

Does exist.

> single-row SELECT with GROUP BY and HAVING clauses grouped views

If it's about grouping sets they do exist.

> no procedures

parametrized views/UDF/executable UDF (UDTF) exist.

> WHERE EXISTS,

Exists, but without Correlated Sub query part, which is honestly a joke because of how subpair performance it usually have compared to other alternatives.

Can you remind, does SQL:2023 standard finally allow you to use DISTINCT keyword in WINDOW functions?

Or people still forced to do horrible hacks with correlated subqueries or joins to do extremely simple thing like count Distinct in running window of X days?

> it's basically impossible to write SQL to any degree of sophistication in Clickhouse.

It's says more about engineer not DBMS.


One thing that I have noticed confuses a lot of people is "timeseries joins" (I don't know the real term for this) I'm talking about where there is no "one to one" match between keys in the two tables.

I'm a non software type of engineer in my world a lot of tables are structured as timeseries data (such as readings from a device or instrument) which uses timestamp as a key.

Then we have other tables which log event or batch data (such as an alarm start and end time, or Machine start/machine stop etc).

So a lot of queries end up being of the form

Select A.AlarmId, B.Reading, B.Timestamp from Alarms A, Readings B where A.StartTime >= B.Timestamp and A.EndTime < B.Timestamp

A lot of people seem to have problems grasping these kinds of joins.


Cool use case. They're just called "non equi-joins" - because the join condition is an inequality. In general a join produces a row in the output table for each (left, right) pair of rows from the input tables that satisfies the join condition. It's just so common for joins to use a simple equality condition, where one or both sides is a unique id for its table, and people don't as often encounter joins where one input row can inform multiple output rows.


Duck and Clickhouse call this an AsOf join - https://duckdb.org/docs/guides/sql_features/asof_join.html



You mean slowly changing dimensions? It’s not something there is much literature on, especially for outer joins.


these does seem to be a fairly good curriculum for an introduction to SQL


Does anyone remember some kind of adventure or mystery novel that you progress through by solving sql challenges?

I remember seeing it once and I can never find it now.


Select Star SQL? https://selectstarsql.com/

It opens up with analyzing death row inmates, so significantly more real than classifying flowers.


Sounds like https://mystery.knightlab.com/ SQL Murder Mystery by Northwestern University Knight Lab.


It's a different area of SWE, but https://deadlockempire.github.io/ is great except the final level.


There was one about space exploration based on Postgres, but I don’t remember the name.


Mastery with SQL. A paid course that‘s excellent.


Yeah it's excellent, but I am pretty sure OP was talking about something else:

https://sales.bigmachine.io/curious-moon


Damn, I meant Curious Moon and said the other great one.

People, do both! Worth every cent!


At least on 3.45.1 double-quoted as data no longer works.

  sqlite> delete from work where person = "tae";
  Parse error: no such column: tae
    delete from work where person = "tae";
                      error here ---^
  sqlite> delete from work where person = 'tae';
  sqlite>


I don't see how this is "for data scientists" it looks like a good summary of SQL in general.


agreed. and sqlite specifically. i’d really like to see an authoritative resource that shows me how to use sql and relational algebra to do some basic and intermediate data analysis. i can find like 8 different ways to calculate skew and kurtosis, but is there a trusted resource that can show me how to do it? what other interesting data analysis can i do using sql?


According to the article itself it's explicitly not anyway - it's for instructors (of data scientists potentially, sure).


Also, remember ChatGPT can help you write queries like a data scientist, without actually being one. Just tell it what you want in plain english. I've used this to get all sorts of useful metrics like conversion rates, messaging click through rates, etc...


> ChatGPT can help you write queries like a data scientist

Badly then?

Personally I want my SQL queries to be written like a database professional.


Conversely, it’s very helpful at dissecting a 3k line insanity query and explaining it.


I have always found SQL to be one of the easiest languages to read, despite not being the easiest to write. Though I have never had to read 3k of SQL, maybe a page or two


I totally agree but some of these are tsql which is less than easy to read imo.


Yes. It'd be interesting to see what it makes of some of those ugly ORM generated queries.


> 3k line insanity query

*laughs in PTSD*


> do not expect novices with no prior SQL experience to be able to learn from them

Please suggest this entry level thing.

SQL is enduring because it is logical and understandable. But there is still an initial vertical learning curve.

I am trying to teach a friend of mine SQL but I’m not sure how to construct a lesson plan.

Is there a canonical SQL 101?


Just above that it states

> [what this is] notes and working examples that instructors can use to perform a lesson

So use the resource to teach your friend.


But… she’s a novice. What am I missing?


Lots of great and very clear examples here. As someone who's been writing SQL for over 30 years, I wish I'd had something like this at the start.

Obviously there's a lot more to learn about many of these areas to really make the most of them, but this is a really good launchpad.


Thanks for sharing this! Learn by example resources can be super helpful.


Some useful SQL queries for me in this list of 100 queries, thanks!


Don’t have time to do it myself, but might be good to redo this guide but with DuckDB (more likely used by DS’es nowadays than SQLite)


Glad to have learned about iif() from this!


It's really handy in conjunction with GROUP BY.


I would expect someone that has the title of data scientists to at least be a master in the universal data query language SQL.

But im so used to ORM these days anything more complex then a sql join is already going over my head if i didn't do a sql refresher. As far as i have skimmed the article it seems like a very good refresher for even a SWE. I will definitely put this tutorial on my todo list.


I’ve found that chatGPT is excellent in helping with generating and testing SQL queries. This tutorial is likely helpful for its intended audience; I’d like to see a bit more discussion of query optimization, stored procedures, table design, and available options both in db and in bindings, but it’s hard to fit everything in one place and keep it clean, which is why LLMs will eventually win.


I've found the same. Especially dealing with sorta complex JSON column migrations where the JSON structure needs to be changed. Can describe in plain english what I want to do and get a working query for it.


Item 70 doesn't look right.


What is a data scientist, if they are not a statistician?

I am a grumpy old man, fed up with newspeak


Whats this newspeak? Why cant people just talk properly like we did before the war? Not that silly little war, the actual war, WW1.

I am a grumpy old man, fed up with all this newfangled nonsense.


Constantly evolving language and job titles. I consider myself a programmer. I market myself as a software engineer in LinkedIn. I am a generalist who uses mostly python / Django. I like SQL and feel that the database should be leveraged to do the heavy lifting. But software engineers plug APIs together these days.


Fantastic summary, saving for just general tips when trying to show people SQL


The diagrams at "check your understanding" sections are great.


Very useful guide to SQL, thanks!


The queries are solid, but I really appreciate throwing in the worlds most confusing diagrams here and there. It keeps me alert trying to find where to even start with them. So to be clear, tables must have a name, and must have rows and the rows must have the same name as the table, yes?


Why bother, just use ChatGPT


[flagged]


Could you please stop doing this?


Nice! It could be better if the queries were interactive.


That's what would change this from a lousy resource to a brilliant one.

Little SQLer ("Little Squealer")


I’m sure MIT Press is working the phone lines right now


Watch out, you're only referring to MALE and FEMALE sex, that might trigger a few people here


Sorry for feeding the trolls, but: I feel like I observe way more people complaining about people getting triggered by M/F sex than I actually see people complain about it. That is why this person had to make up the triggered person, because you will not find them in this comment section.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: