Not to detract from the article, but: Wow the meaning of the term "data scientist" has changed since the days of "sexiest job". From the article description:
- Rachel has a master’s degree in cell biology and now works in a research hospital doing cell assays.
- She learned a bit of R in an undergrad biostatistics course and has been through the Carpentries lesson on the Unix shell.
- Rachel is thinking about becoming a data scientist and would like to understand how data is stored and managed.
Data Scientists, back in the day, were largely people with both a fairly strong quantitative background and a strong software engineering background. The kind of people who could build a demo LSTM in an afternoon. Usually there was a bit of a trade-off between the quant/software aspects (really mathly people might be worse coders, really strong coders might need to freshen up on a few areas of mathematics), but generally they were fairly strong in each area.
In many orgs it's been reduced to "over paid data analysts" but I wouldn't even hire "Rachel" for a role like that.
The term has always gone in a half-dozen directions at once, and ranged anything from
* an idiot making PPT decks for business presentations based on sales data; to
* a statistician with very sophisticated mathematical background but minimal programming skills doing things in R or State; to
* a person with a random degree making random dashboard in Tableau; to
* a person with sophisticate background in software engineering, data engineering, and related fields who can kind of do math
* an expert in machine learning (of various calibers)
* a physicist using their quantitative skills to munge data
... and so on. That's been confusing people since the title came out. It depends on the industry, and there's a dozen overlapping titles too, some with well-defined meanings and some varying from company to company (business analyst, data engineering, etc.).
This is so true. Outside of maybe FAANG companies, a lot of places have wildly different expectations for that role. While one company may refer to the guy doing simple PPTs as a business analyst, others might call that a data analyst or a data scientist or something else. The pay probably mostly reflects the truth though outside of exceptions from office politics.
Relatedly - and I’m a lead data engineer at my current $JOB — I’ve yet to find a definition of what a data engineer is/does that I find easy to share with people. Of course I have flippant ones (YAML dev with a bit of Python) but nothing more than: Database Admins who learned Python and now care about more of the data lifecycle than the data that resides in the DBs they managed.
Not a data engineer per-se, but I do a lot of data stuff.
I find AI has revolutionized anything one-off, and data stuff has a lot of one-off. 80% of the time, I can ask the LLM and get a solution which would take 1-2 hours to code which works.
* Verification of correctness is unnecessary or nominal, since it only needs to work in one case. If it doesn't handle complex corner cases as come up in software systems, there aren't any. And in most cases, the code is simple enough you can verify at a glance.
* Code quality doesn't matter since it's throw-away code.
It's something along the lines of:
"I have:
[cut-and-paste some XML with a hairy nested JSON structure embedded]
I want:
[write three columns with the data format I want, e.g. 3 columns of CSV with the only the data I need]"
Can you make a Python script to do that?
[Cut-and-paste script, and see if it works]
If it does, I'm done. If it doesn't, I can ask again, break it into simpler steps, ask it to debug, or do it by hand. Almost no time lost up to this point, though, and 80% of the time, I just saved two hours.
In practice, this means I can do a lot more prototyping and preliminary analysis, so I get to better results. Deadlines, commitments, and working time has not changed, so the net result is much higher quality output, holistically.
I think I need to revisit my hesitation to using LLMs. I think it stems from stubbornness. I’d rather write the boilerplate and the code to do the transformation or do so in DuckDb in SQL but if the tool can do it well enough so be it.
The bit about one-offs is not my experience. The idea being that writing connector code or extract stuff or even data cleaning changes based on the source and is usually put in production.
Ideal would be an endpoint to send data to like your example with sample data and then have it return after a prompt with the code needed or bypass the code just give me the subset of data that I request.
This discussion started with what a data engineer does, and the diversity of roles. I wasn't trying to push my workflow on anyone. With what I'm doing right now (which includes data and SWE; I'm now writing more holistically), there is a flow:
---
Step 1:
- I do a lot of exploratory and one-off analysis, some of which leads to internal memos and similar and some of which goes nowhere. I do a lot of prototyping to.
- I do a lot of whiteboarding with stakeholders. This is also open-ended and exploratory. I might have a hundred mock-ups before I build something which would go into prod (which isn't a lot of time; a mock-up might be 5 minutes, so a hundred represents a few days' time).
This helps make sure: (1) I have enough flexibility in my architecture to guide likely use-cases, and I don't overengineer for things which will never happen (2) I pick the right set of things.
---
Step 2:
I build high-fidelity versions of the above. These, I can review e.g. with focus groups, in 1:1s, and in meetings.
---
Step 3:
I build production-ready deployable code. Probably about a third of the thing in step 2 reach step 3.
---
LLMs do relatively little for step 3. If I have time, I'll have GPT do a code review. It's sometimes helpful. It sounds like you spend most of your time here, so you might get less benefit than I do.
For step 2, they can often build my high-fidelity mockup for me, which is nice. What they can't do yet is do so in a way which is consistent with the rest of my codebase (front-end theming, code style, tools used, etc.). I'll get something working end-to-end quickly, but not necessarily something I can leverage directly for step 3.
However, in step 1, they've had a transformational impact. Exploratory work is 99% throw-away code (even the stuff which eventually makes it to prod; by that point, it has a clean rewrite).
One more change is that in step 1, I can try different libraries and tools too. LLMs are at the level of a very junior programmer, which is a lot better than me in a tool I've never used. Evaluating e.g. a library might be a couple of days of learning with the equivalent of 5 minutes - 1 day of building (usually, to figure out it's useless for my use-case). With an LLM, I have a feasible lousy first version in minutes. This means I can try a half-dozen libraries in an hour. That didn't fit into my timelines pre-LLM, and definitely does now. I end up using better libraries in my code, which leads to better architecture.
So YMMV.
I'm posting since I like reading stories like the above myself. Contexts vary, and it's helpful to see how things are done in contexts others than my own. If others have them, please feel free to share too.
ChatGPT a plurality of the time. I go through the API with a little script I wrote. If it doesn't work, I step up to GPT4. The costs are nominal (<$3/month). ChatGPT has gotten worse over time, so the need to escalate is more frequent; when it first came out, it was excellent.
API (rather than web) is more convenient and avoids a lot of privacy / data security issues. I wouldn't use it for highly secure things, but most of what I do is open-source, or just isn't that special.
I have analogous scripts to run various local LLMs, but with my setup, the init / cooldown time is long enough that it's easier to use a web API. Plus my GPU is often otherwise occupied. Most of what I use my GPU for are text (not code) tasks, and I find the open source models are good enough. I've heard worse things about them for code, but I haven't experimented enough to see if they'd be adequate. Some of that is getting used to how the system works, good / bad prompts, etc.
ollama + a second GPU + a running chat process would likely solve the problem for around ≈$2k, so about the equivalent of a bit over a half-century of calls to the OpenAI API. If I were dealing with something secure, that'd probably make sense. What I'm doing now, it doesn't.
That’s a good question. I think LLMs will have a place in the connector space. It would be really cool if they could dynamically handle changes in the source (the api changed and added some new data new columns etc). But right now — at least I — don’t trust AI to do much of anything in terms of ingestion. When data is extracted from the source it’s got to be as close to a 1:1 of the source as possible. Any errors introduced will have a snowball effect down the line.
For data cleaning we do tend to write the same sort of things over and over. And that’s where I think things could improve. Though what makes a data engineer special in my mind is that they get to know the nuances of data in detail. They get familiar with the columns and their meanings to the business and the expected volume and all sorts of things. And when you get that deeply involved with the data you clearly see where things are jarringly and almost like a vet to a sick animal you write data cleaning things because you care about the data that much.
Lets face it job titles are a lot of bullshit. I was a "programmer". I call myself a "software engineer". I probably do better SQL than many data engineers / scientists, which is getting annoying as I am shoehorned into roles where I plug API's together rather than deal with SQL. But Data engineer roles, always want a load of stuff I have never needed to deal with.
Data scientists with a strong software engineering background , where are they hiding?
Jokes apart there used to be two categories of data scientists, those that came from a science/phd background where they duct taped their mathematical understanding to code which might work in production, and those those that come from a CS background that duct taped their mathematical/medium tutorial knowledge to an extravaganza of grid search and micro-services that made unscientific predictions in a scalable way.
So now we have the ml engineer (engineer) and the data scientist (science) with clear roles and expectations. Both are full time jobs, most people cannot to both.
but more seriously, unless someone's explicitly doing ML research for most applications using something off-the-shelf-ish[0] and tinkering with it works best. and this mostly requires direct experience[1] with the stack.
and sure, of course, if said project/team/org/corp has so much money they even can train their own model, sure, they can then afford to have these separate roles with "more dedicated" domain experts.
[0] from YOLO to LLaMa to whatever's now on HuggingFace
[1] the more direct the better. you have used LLMs before? great. pyTorch? great. you can deploy stuff on k8s and played with ChatGPT? well, okay, that's ... also great. you know how to get stuff from Snowflake/Databricks/SQL to some training job? take my money!
I've seen a disturbing rise in the number of people who think data engineering isn't software engineering. I don't plan to play up that part of my experience the next time I'm applying.
It's because data engineering has been reduced to be able to login to a cloud provider and know which workflow to drag and drop. This is easily learned in a couple of weeks so that s why those skills might not be considered software engineering.
Well, that's the GP's point, I guess: this thing was called "Business analyst", and, honestly, I don't know what being a domain-expert with somewhat above-average computer skills has to do with "data science".
> people with both a fairly strong quantitative background and a strong software engineering background.
In my experience this intersection is a null set. And not just that it's an extremely rare feat to pull off IMO, the mental bandwidth and time needed to be good at one of those two alone would consume one person fully. This is why quant/stat specialists were paired with ETL/data-pipeline specialists to build end to end solution.
One reason Data Science became such a hot role back in the day was that it was amorphously defined; because no one knew what exactly it entailed folks across a broad range of skill sets (stats, data engineers, NoSQL folks, visualisation and so on) jumped into the fray. But now companies have burnt their hands, they have learnt to call out exactly what's needed; even when they advertise for DS role they specify what's required of them. For example, this page on Coursera[1] is clear about emphasis on Quant, which is a welcome development IMO.
Yes, unfortunately when it was declared the Sexist Job, there was a tremendous influx of bootcamps promising you a six figure income after 3 months of part-time study. That has certainly lowered the overall quality of the Data Scientist title.
One funny aspect about the changing definition of "data scientist" is that I, currently a data scientist, spend most of my professional day working with the LLM/AI modeling areas nowadays and building custom models instead of building analyses and dashboards, since the former is more impactful.
Job positions still want the latter, though. If I ever left my job I'm not confident I could get another job with the Data Scientist title, nor could I get a "ML Engineer" job since those focus more on deployment than development.
My R is embarrassingly rusty nowadays and I miss making pretty charts with ggplot2.
I literally was asked two leetcode questions verbatim when interviewing for a data science position at TikTok a few months ago. Dynamic programming (I won't mention which question) and then one regarding binary trees.
Depends on the company, a research level MLE would be asked to derive loss functions and perform partial differentiation on pen and paper. You have to answer questions like what Kullbeck Leiber divergence is and how it can be utilized etc.
You must have worked at different places from me. Nearly every DS job I had (before wisely apparently) leaving that area had leetcode style algo questions during the interviews.
>Data Scientists, back in the day, were largely people with both a fairly strong quantitative background and a strong software engineering background. The kind of people who could build a demo LSTM in an afternoon.
As a field of "science" perhaps.
In real life (when it became hot) data scientists mostly meant "devs doing analytics" and a lot of it involved R and Python, or the term "big data" thrown around for 10GB logs, and things like Cassandra, with or without some background in math or statistics.
What it never has been, in practice, was a combination of strong math/statistics AND strong software engineering background. 99.9999% of the time it's one or the other.
In the Enterprise, the best qualified person for any specialisation you may want has always been whoever IBM/Oracle/Tata has sitting on their bench that week.
They have the courses and certifications to prove it, too. It's magic!
i think you may be overestimating the avg. past data scientist's software engineering chops, but it's definitely true that the term has become more diluted than ever
you still find these kinds of people and roles at smaller companies but at largecorps, what's the point? the interesting modelbuilding you shunt off to your army of phd-holding research scientists. deploying models and managing infra goes to MLE. what's left is the data analyst stuff, which you repackage as "data science" because cmon, "analytics"? are we dinosaurs? this is modern tech, we have an image to uphold!
there's not really a need for, or supply of, people who can do everything (edit: _at largecorps_, obviously)
Oh sure, if you have teams of research scientists and machine learning engineers to shunt the work to. That's, like, what? 5% of companies out there? Less?
I really wonder whose fault is it. Unfortunately, what I see the most are many companies expecting you to be a jack of all trades (you should have GenAI/LLM skills, ML, Data Engineering, and what not)
as the name implies a data scientist is a scientist that works on data. There is no reference to the need to be able to code a LSTM in one afternoon (and it would be absurd for most DS tasks)
Really excellent concise SQL guide, and great teaching by example. At most universities in the US this content gets taught over an entire semester (and probably shouldn't be). This guide is complete enough for 99% of industry SWE jobs outside of database optimization.
It's a great service when someone takes the time to document knowledge on a single page with quality examples, and trust the reader to follow along. Reminds me of the Rudin analysis book.
There is a W3Schools online tutorial that is excellent. I had all the new engineers do it before they started running queries at work. This guide is great too.
Shameless Plug: If anyone here wants to practice their SQL, they are welcome to try my Mac app: TextQuery [1]. I built it because I wanted to quickly import CSV datasets and run SQL queries on them. I don't think there could be a more fun way to learn SQL than to jump in and start analyzing thousands of public datasets. Sure, you can use CLI/Code as well, but GUI is often faster and easier. Currently, the app is in the beta period and free-to-use. When launched, you'll get to keep the latest beta version.
Nice. I built some code that did this about a ~6 years ago, but it had no GUI and had to deal with some very irregular csv files. I always wondered why the default SQLite didn't have better support for reading CSV directly and auto ascertaining data types.
As mentioned by others, the author Dr. Greg Wilson has written/compiled many books/tutorials which I can recommend. I would especially laud Software Design by Example [1][2], The Architecture of Open Source Applications [3] and Teaching Tech Together [4].
Before flights with patchy/no wifi, I often download a long, single-page tutorial. This is perfect. Curious if anyone knows of any for other languages/tech (e.g. beyond SQL).
A failed attempt was to load (very) many (e.g. about 100) pages of javascript lessons from w3schools before the plane took off, but for some reason the pages tried to refresh during the flight and I lost them all, so that was a massive waste of time (opening them all before the flight took about 20 minutes).
https://cryptopals.com/ (a remake of the Matasano crypto challenges) understand crypto by actually building and then subsequently breaking it (it's not strictly single-page, but wget can mirror it nicely)
Edit: also an inaccuracy that's minor but can bite you if you're not careful - they mention temporary tables are in memory not on disk - that's not true in almost all sql databases, they are just connection specific eg they don't persist after you disconnect.
Some databases are optimized for a temp table to be a throwaway, but that can be a good or bad thing depending on the use case.
> full outer join (also called cross join) constructs their cross product
Full outer joins and cross joins are different types of joins. A cross join returns the Cartesian product of both tables, while a full outer join is like a combination of a left and right join.
Could have done with a visual representation of the starting tables and resulting tables to be a good description.It was 20 years ago that I was learning about cartesian joins. When I do SQL I visualize an excel like structure in my head.
> A join that is guaranteed to keep all rows from the first (left) table. Columns from the right table are filled with actual values if available or with null otherwise.
This wording only works for identity equality join condition. It creates misleading mental model of left joins, and unfortunately is very common.
I'm not sure I understand, I think this definition still works for left outer joins on conditions other than identity equality, since joins on, say, inequalities or multiple conditions would still be "guaranteed to keep all rows from the first (left) table. Columns from the right table are filled with....".
It kind of implies that the left join process works by first taking the left table and then filling in the right table with a match if one exists, and otherwise null.
That model obviously doesn't work because if there's more than one match as the matching left row is duplicated for each match. However I don't understand their point of this being a problem when you don't have a "identity equality join condition", since this can also occur for equality joins as long as you're not joining on a unique key.
I'd say it's misleading at best. It should specify that for multiple matches in the right table, the row from the left table is duplicated per matching row in the right table.
Can you clarify? Is the problem here that the "guaranteed to keep all rows from the left table" piece is accurate, but the section about the right table might not be a good mental model if the join condition is more complex than a simple equality?
What really distinguishes an SQL master is working with queries hundreds of lines long, and query optimization. For example, you can often make queries faster by taking out joins and replacing them with window functions. It's hard to practice these techniques outside of a legit enterprise dataset with billions of rows (maybe a good startup idea).
one of the most performance-killing practices i’ve seen (outside of brutal orm-generated queries) is the sql hero who wants to do everything all at once in a single splat of sql. at a certain level of complexity, breaking things up into logical units and even forcing materialization (instead of hoping for some good intermediate materialization) can go a long way
A lot of these queries are only valid on SQLite (or maybe not "only" but certainly not generally available across DBMS's), including cases where the author doesn't point this out. For one example filtering on aggregates is not supported by MySQL, MS SQL, Oracle or most other DBMS's.
> PostgreSQL supports most of the major features of SQL:2023. Out of 177 mandatory features required for full Core conformance, PostgreSQL conforms to at least 170. In addition, there is a long list of supported optional features.
Clickhouse is simply a joke in comparison. No basic cursor support, no transactions, incomplete comparison operators, no single-row SELECT with GROUP BY and HAVING clauses grouped views, no procedures, no anti-joins or WHERE EXISTS, for that matter. The list goes on... it's basically impossible to write SQL to any degree of sophistication in Clickhouse.
Exists, but without Correlated Sub query part, which is honestly a joke because of how subpair performance it usually have compared to other alternatives.
Can you remind, does SQL:2023 standard finally allow you to use DISTINCT keyword in WINDOW functions?
Or people still forced to do horrible hacks with correlated subqueries or joins to do extremely simple thing like count Distinct in running window of X days?
> it's basically impossible to write SQL to any degree of sophistication in Clickhouse.
One thing that I have noticed confuses a lot of people is "timeseries joins" (I don't know the real term for this) I'm talking about where there is no "one to one" match between keys in the two tables.
I'm a non software type of engineer in my world a lot of tables are structured as timeseries data (such as readings from a device or instrument) which uses timestamp as a key.
Then we have other tables which log event or batch data (such as an alarm start and end time, or Machine start/machine stop etc).
So a lot of queries end up being of the form
Select A.AlarmId, B.Reading, B.Timestamp from Alarms A, Readings B where A.StartTime >= B.Timestamp and A.EndTime < B.Timestamp
A lot of people seem to have problems grasping these kinds of joins.
Cool use case. They're just called "non equi-joins" - because the join condition is an inequality. In general a join produces a row in the output table for each (left, right) pair of rows from the input tables that satisfies the join condition. It's just so common for joins to use a simple equality condition, where one or both sides is a unique id for its table, and people don't as often encounter joins where one input row can inform multiple output rows.
At least on 3.45.1 double-quoted as data no longer works.
sqlite> delete from work where person = "tae";
Parse error: no such column: tae
delete from work where person = "tae";
error here ---^
sqlite> delete from work where person = 'tae';
sqlite>
agreed. and sqlite specifically. i’d really like to see an authoritative resource that shows me how to use sql and relational algebra to do some basic and intermediate data analysis. i can find like 8 different ways to calculate skew and kurtosis, but is there a trusted resource that can show me how to do it? what other interesting data analysis can i do using sql?
Also, remember ChatGPT can help you write queries like a data scientist, without actually being one. Just tell it what you want in plain english. I've used this to get all sorts of useful metrics like conversion rates, messaging click through rates, etc...
I have always found SQL to be one of the easiest languages to read, despite not being the easiest to write. Though I have never had to read 3k of SQL, maybe a page or two
I would expect someone that has the title of data scientists to at least be a master in the universal data query language SQL.
But im so used to ORM these days anything more complex then a sql join is already going over my head if i didn't do a sql refresher. As far as i have skimmed the article it seems like a very good refresher for even a SWE. I will definitely put this tutorial on my todo list.
I’ve found that chatGPT is excellent in helping with generating and testing SQL queries. This tutorial is likely helpful for its intended audience; I’d like to see a bit more discussion of query optimization, stored procedures, table design, and available options both in db and in bindings, but it’s hard to fit everything in one place and keep it clean, which is why LLMs will eventually win.
I've found the same. Especially dealing with sorta complex JSON column migrations where the JSON structure needs to be changed. Can describe in plain english what I want to do and get a working query for it.
Constantly evolving language and job titles. I consider myself a programmer. I market myself as a software engineer in LinkedIn. I am a generalist who uses mostly python / Django. I like SQL and feel that the database should be leveraged to do the heavy lifting. But software engineers plug APIs together these days.
The queries are solid, but I really appreciate throwing in the worlds most confusing diagrams here and there. It keeps me alert trying to find where to even start with them. So to be clear, tables must have a name, and must have rows and the rows must have the same name as the table, yes?
Sorry for feeding the trolls, but: I feel like I observe way more people complaining about people getting triggered by M/F sex than I actually see people complain about it. That is why this person had to make up the triggered person, because you will not find them in this comment section.
- Rachel has a master’s degree in cell biology and now works in a research hospital doing cell assays.
- She learned a bit of R in an undergrad biostatistics course and has been through the Carpentries lesson on the Unix shell.
- Rachel is thinking about becoming a data scientist and would like to understand how data is stored and managed.
Data Scientists, back in the day, were largely people with both a fairly strong quantitative background and a strong software engineering background. The kind of people who could build a demo LSTM in an afternoon. Usually there was a bit of a trade-off between the quant/software aspects (really mathly people might be worse coders, really strong coders might need to freshen up on a few areas of mathematics), but generally they were fairly strong in each area.
In many orgs it's been reduced to "over paid data analysts" but I wouldn't even hire "Rachel" for a role like that.