Regarding #5:
>That caught me by surprise. Both options “roughly the same” and “depends on the data” got about 25% — the guessing probably.
I don't think it was guessing so much as reasoning that fetching 100 rows (and filtering by value) instead of 10 rows doesn't have significant real-world impact unless the row data is particularly large. I'll admit I didn't think of the out-of-index lookup, but my main thought was 100/1000000 vs 10/1000000 isn't a big deal unless the DB is running on an abacus.
I also answered "about the same", and no I didn't notice the bookmark lookup (so-called on MSSQL), but even if I had I'm not sure I would have changed my answer--well maybe I would have because I would have noticed the "trick", but putting my test-taking adaptations aside....
It is already a highly selective query. Adding a 100 bookmark lookups will not cause a material change in performance, unless this query is being executed 1000s of times per second, in which case maybe your real problem lies elsewhere.
At least on MSSQL, I would expect a query plan like so:
1. Index seek WHERE a = 123, yielding ~100 rows. [1]
2. Bookmark lookup with results from (1), yielding ~100 rows.
3. Filter (2) WHERE b = 42 and project date_column, yielding ~10 rows.
4. Aggregate (3) by date_column, yielding ~10 rows or less.
And the optimizer will choose this over a full table scan so long as the 1+2+3 < full table scan. I don't know the threshold for that, but is is certainly more than 100 rows out 1M+, and the planner will have an estimate of selectivity that will inform plan selection.
[1] Important caveat. I interpret this line,
Current situation, selecting about hundred rows out of a million:
....to mean selection of 100 rows from the base table, rather than projection of 100 rows in the result, post-aggregation.
But if he really means a SELECT statement that returns 100 rows, then we have no idea how selective the WHERE clause is, and my answer changes to "The query will be much slower (impact >10%)".
No query optimizer would look at this and say "1M rows? Let's group and aggregate before filtering." Not to mention, the question specifically states that a=? would return 100 rows and a=? and b=? would return 10.
Regarding O(1), the first query would be some form of O(n log n) or O(log n) depending on the table/index data structures.
No query optimizer would look at this and say "1M rows? Let's group and aggregate before filtering."
I hope no optimizer would say that. It is well defined that filtering, as expressed in the where clause (if it uses indexes or not) happens before group and aggregate, and that grouping happens on the result of the filtering. If the optimizer could choose one way or the other, you'd have different results. If you want to group and aggregate first, you need to explicitly express that with a subquery.
> The worst case for the latter query is that it must aggregate over 999,910 rows.
No, it already said it only took 100 rows, it can't get worse from that.
Now if he actually meant the final result set was 100 rows (meaning after the group by) that's different. But that's not what he actually said, so the question is misleading.
My thought was akin to what he said, the system needs to grab all 100 rows to do the second filter.
I didn't think about the fact that the original one was an index-only scan, so went with "Roughly the same", since outside of that property the performance is similar.
Too bad there is no way to get good data on why people thought that way without a much longer quiz.
it has significant impact because first case is "take 100 rows from index" and the second is "take 100 rows from index _and_ for each row go to the row in the table - do the random IO and with 1M rows it is probably 1 IO/row - and check for the value of 'b'"
Such 100 random IOs will cost 0.5 sec on 1 iron platter HDD for example. So the query performance will degrade significantly until either the table is already preloaded into memory or you use SSD drives.
> Such 100 random IOs will cost 0.5 sec on 1 iron platter HDD for example.
That's an incredibly, incredibly, iffy and mostly wrong statement which depends on arguably a corner case which doesn't often reflect reality (factors include which DBMS, row ordering, table size, cache size, block size, page size, RAM size, Hard Disk seek time, HDD throughput.
The only case where that's likely is performing a very cold query on a very large randomly distributed table once (and probably only once).
Even a table of 1 million rows with ~30B per row could easily be read into memory in about 300ms (100MB read time + ~5ms seek time, or ~= 5+(1e6*rowsize/ (100e3)) )
>> Such 100 random IOs will cost 0.5 sec on 1 iron platter HDD for example.
>That's an incredibly, incredibly, iffy and mostly wrong statement which depends on arguably a corner case which doesn't often reflect reality (factors include which DBMS, row ordering, table size, cache size, block size, page size, RAM size, Hard Disk seek time, HDD throughput.
you're welcome to specify iron platter HDDs which would do 100 random IOs in significant different time than 0.5 sec.
>Even a table of 1 million rows with ~30B per row could easily be read into memory in about 300ms (100MB read time + ~5ms seek time, or ~= 5+(1e6*rowsize/ (100e3)) )
>Query Optimizers do exactly this.
Not always. If it has enough info, it may take a full table scan path in such a case. Still it will be about the same time - 300ms vs. 0.5 sec. i mentioned.
Of course it is for cold queries. You forgot to mention, i guess, that full table scan you propose may by-pass DB cache (default depends on DB, modern tendency is by-pass by default), and thus second query will take the same time, while table blocks brought in by random IO would frequently be stored in DB cache thus making second query run somewhat faster - depends on how much data was brought in.
I had the same thought. Seems like the optimizer could still perform an index-only scan to get to 100 rows, then go to the table to filter them down to 10 rows. Yes, the second step is extra, but should still be fast. What am I missing?
That was my reasoning when I answered, but I had missed the fact it was a GROUP BY, which means you can't just filter after the fact.
Edit: In other words it was 100 or 10 aggregated rows. A extra WHERE clause will change the values of each of the rows rather than just filter the rows from 100 to 10. (Which a HAVING clause would do.)
It's much simpler than that. The first query only has to reference the index because the data is IN the index. The second query has to access the table. That's it.
But if it wasn't for the GROUP BY, filtering 100 results of a million down to 10 results wouldn't change performance much even if you read every column of every row of those 100.
The trick is the fact that the GROUP BY means that "It used to return 100 it now returns 10" is a red herring, it still has to read every row to make up those 10.
SELECT date_column, count(*)
FROM tbl
WHERE a = @a
AND b = @b
GROUP BY date_column;
The "AND b = @b" causes the sql engine to access data in the table instead of solely relying on the index. GROUP BY has 0 to do with it. If you changed the query to
SELECT a, date_column
FROM tbl
WHERE a = @a
and
SELECT a, date_column
FROM tbl
WHERE a = @a
AND b = @b
SELECT a, date_column
FROM tbl
WHERE a = @a
AND b = @b
Will only have to scan column b over 100 rows.
Even without an index that will always be neglible, not compared to using the index to grab 100 rows from 10million but just compared to running a query and returning results at all.
The reason that the original can be a lot slower is that the 100 and 10 rows of results are comprised of a lot more rows of actual information, because of the grouping.
You're right that:
SELECT a, date_column
FROM tbl
WHERE a = @a
AND b = @b
would be a lot slower, given the same data, but that isn't the scenario, the group by has implications about what "returns 100 rows, returns 10 rows" actually means in terms of data read.
Query 1 is an index seek only. It does not access the table data.
Query 2 will perform the same index seek but will need to do a key lookup on each row and filter.
It's not negligible. The 100 results are not comprised of a lot more information in this case, regardless of the grouping, because the 1st query does not access the table.
Edit:
I happen to have a table laying around with a little over a million rows and set up a similar set of queries.
The query optimizer suggested the index seek taking 6% of total operation time while the key lookup taking up the other 94%. The rest was negligible.
I consider the critical path to SQL performance to be understanding what the data looks like and the type of queries to be executed against it rather than general guidelines.
To be frank, knowing the difference in how to write an inner and outer join when given a three table schema and a desired output is a frightening filter of candidates. Having an instinct that some type of index could help a query probably makes you a db wizard.
knowing the difference in how to write an inner and outer join when given a three table schema and a desired output is a frightening filter of candidates
Even just asking for a basic understanding at the "use this to look up that" level -- since SQL syntax is easy to learn if you know what you want, and whoever does the recruiting has taken to mostly finding us new college graduates -- filters out an absurd number of candidates.
I guess it's the same thing as makes FizzBuzz a useful question.
I honestly don't grasp why this is a good question.
Sure, I write a SQL query now and again, but if I have to join I look it up. I don't do it a lot. If I had to do it a lot for you, I would learn it inside out.
If I told you I was a DB wizard, then sure, I better already know how to join. But I know tons of people that don't really do SQL. They are eminently hire-able and valuable.
source: a guy (me) who was bounced out of an interview because he didn't recall some specific detail about boost::shared_pointer().
If you don't know how to do a JOIN you really don't know how to use SQL, and that's not a tricky use of JOINs at all. If you're hiring for a position that mostly uses an ORM or something and raw SQL is only important in cases where the ORM is giving bad performance, maybe not knowing how to do a three-table join is fine. If actually writing SQL is part of the job regularly, though, the sort of thing grandparent is talking about is absolutely a bare-minimum of knowledge needed, maybe even below the bare minimum.
To clarify, two SQL questions were asked during an interview round for a job description that specifically required the ability to write ANSI SQL against databases for reporting/analytics purposes. Opposed to bringing candidates in for a job description requiring Java experience and then dropping a couple of SQL "gotcha" questions on a surprised candidate.
I think at this point, most HN'ers would accept that a work sample test (of some limited scope and effort) is a better test than any interview question or whiteboard coding process. From that perspective, no question is really a good one. Unfortunately, at my workplace, there are strict rules in place specifically prohibiting this type of applicant process.
I tried to think of an analogy of someone dragging CSS into this conversation, but I couldn't. Maybe like throwing a dead cat into a room full of aristocrats?
The difference is probably .1us. You may find it interesting why browsers match CSS selectors from right to left (though this is a useless bit of trivia). see:
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." --Donald Knuth
Except using indexes is a major part of developing for a database. In many simple schemas, performance will completely break down without proper indexing. That's a bit different than spending time investigating if for-each is faster than a traditional loop.
My exception was to the idea of this being a premature optimization (sorry, I probably should've replied directly to your comment with the CSS example)
I am aware of that quote. And that is precisely my point, he is talking about premature optimization, not optimization. What could possibly make you think that the questions in this quiz are examples of premature optimization?
As much as we would love to have a magical language that knows exactly how to take anything and make it performant enough, there is a decent amount of evidence to say that is impossible.
Honestly though the nice thing about SQL is you can quickly get something that works, and then use tracing tools to figure out what is taking longer than it should.
Think of the first question. If you're going to use a function to transform data, you're going to take a performance hit vs. just doing and indexable operation against the data.
If you want a facility that will magically figure out what you want when you are unable to express what you want, good luck. That's unicorn territory.
"On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question." ~ Charles Babbage
SQL is a query language and an RDBMS is an implementation.
It's like calculus, you can compute efficiently or inefficiently it has nothing to do with calculus.
The same developers that make retarded SQL queries make god awfully slow programs with their pointer chasing mess because they don't understand that their programs run on actual real hardware and that a cache miss is pretty much the single most expensive operation a modern CPU can make by 2 to 3 orders of magnitude.
They just go on and on about interfaces, inline methods, virtual methods, etc, and never check whether their fancy smancy object fits on a cache line. Its the same with databases they just talk about 'webscale' and 'big data' abandon SQL and then wonder why their DB explodes once it reaches 100 GB.
SQL is a bit deceptive in that it cloaks set-based operations in something apparently simple. Iterative operations are easier for most people to conceptualize—you could create a flow chart, whereas sets require a bit of math. Once you can think in terms of sets, SQL becomes a lot clearer. The issue with sets is, in my opinion, the root of problems in Object-Relational Mapping.
There were other query languages that were less of a messy compromise. QUEL, for example was based more rigorously on relational calculus. I think it never took off because the dumbed-down SQL was a little easier for programmers to conceptualize initially.
Hmmm, I'm no SQL jockey, rarely deal with it directly, and currently work with MongoDB. I got 5-of-5 on the PostgreSQL flavor of the test but am sure I'd fail a whiteboard interview on SQL particulars. Being a generalist I can usually figure things out and think I have an intuitive grasp of issues involved in many systems -- but unfortunately only 30% of workplaces understand that intuition, adaptability, and general experience usually trump specific knowledge.
Fortunately I've landed pretty good jobs in spite of employer myopia, and work for a SV salary and a SV company though I live in a non-tech town in North Carolina. Just venting a frustration.
Figuring things out is great, it's how we all learn, and it's an essential skill. But if you lack curiosity and/or respect for the specifics of the technologies you're deploying, the systems you build by intuition alone will ultimately fail.
Yes, but I suspect his point was "I work with NoSQL, so testing me on my SQL knowledge isn't a very good proxy for my ability to tune a database". If he didn't know the details of MongoDB, and that was in the hot path (let's face it, for many situations it just doesn't matter), then that would be worrying.
A guess, based on my experience and accounts of others here and on other similar forums. The figure could be much different for smaller startups, I've usually joined teams of between 3 and 15 engineers, where the adaptability and learn-as-you-go are more valuable.
Interesting, every question in this quiz is about the concept of covering indexes, and figuring out if the query in question is covered by the suggested index or not.
I'm surprised people didn't score better on this, it's a very simple concept. :-/
The author has a website and sells a book devoted to teaching people about SQL indexing, so that's where his focus is. I wonder what kind of a pool he got; given that he seems to have given it mostly to people reading his site, you'd think that it'd be people who know more about SQL indexing than the average.
Most troubling to me was how people who chose the MySQL option did so much worse than pretty much every other database. (I took the MySQL option, even though I work on MS SQL these days, and got 4 out of 5.) I suppose for people who consider MySQL to be a toy RDBMS, that's less "troubling" than "confirming current perception."
I agree. I scored 5 out of 5 (Oracle flavour), and didn't have to think too hard about any of the questions. The review article specifically mentions that he has no data about the kinds of people answering the test though. A lot of people just don't do raw SQL that often.
He wonders about PostgreSQL users doing poorly on question 4. This is a fairly tricky question. Since the index is a btree index, it is pretty clear what the right answer is, but there are ways to address the wildcard issue with better indexes. PostgreSQL folks who have less familiarity with the db may miss the fact that a GIN or GIST index is required along with the pg_trgm addon, to make that query fast.
I have two nearly-identical tables, both shaped like (foreign_key some_data_type, name varchar2, value varchar2). (Before you say "that means you should use No-SQL", this is a staging environment for loading data into a claims handling system. Which is built in a language that comes with a relational db built in.)
They both have about 50M rows, with statistically identical data. I was running near-identical large queries on both, with the same execution plan (nested-loop join with index lookups), and getting vastly different timings.
This being a staging environment, these tables are re-populated by truncating them and running an ETL tool. What turned out to be happening, is that in one case the source query on the ETL tool was sorted by the foreign key, and in the other it wasn't. So in one case all those fetch-by-index-lookup operations added to to essentially a partial table scan, and in the other they added up to what you'd expect where blocks would be fetched in random order and probably re-fetched after falling out of cache.
Interestingly, if you worked a lot with NoSQL systems, you could get 5/5 on that score.
Although I worked a lot with SQL in past, I think good knowledge of only a NoSQL system (e.g. GAE Datastore) would give you enough understanding of how simple indexes are, and how to deal with them to pass this pure SQL test.
Would like to hear a feedback from someone with only NoSQL experience after taking that test.
Maybe I'm just a bit slow, but what's going on with the 50/50 "guessing score" reasoning? Surely, you should half the wrong answers as well, rather than just subtracting it from the right answers?!?
If proportion X (between 0 and 1) of the population knows the right answer (we'll call A, and the wrong answer B), and the other half guesses evenly, you would expect:
A = X + (1 - X) / 2
So to calculate X, given A,
2A = 2X + 1 - X = X + 1
X = 2A - 1
The author implied the formula
X = A - 0.5
which is not correct. In particular, assume everyone got the answer right (X=A=1). Then the assertion that half of the answers are correct guesses is absurd.
The correct lucky guess fraction (amount of the right answers to discard) is:
A - X = A - (2A - 1) = 1 - A
If A is near to 0.5 (which we would expect if our model is accurate and few people know the answer, the 50% approximation the author used is about right.)
There are two choices. If everyone guessed, the expected value is 50% correct and 50% incorrect. This is why it's 25/75 for the question with four choices.
I like some of the examples they provide, however this is very platform dependent. Some advanced platforms have highly optimized query engines, to where even bad queries can be handled if they are run many times. Expressions can be reordered without the users knowledge and the results will be the same.
SQL is extremely powerful. Whenever something comes up and says that it's going to hide the complexity of SQL/raw DB access, it's lying. :) I can work well in many cases, then it bites you in the ass later. That's not a reason to not use ORMs for example, but without knowing how the underlying DB works you can have problems. Abstractions are leaky, always. Same applies if you work in a high level language and you don't care/know how caches work; if you use HTTP requests and you have no idea how HTTP works, and so on.
Indeed. I would actually go further and say it can be a reason not to use ORMs though. But the rest of this is to bolster your point and hopefully give you some more ammo for your case.
One of the reasons I started working on the PGObject Perl framework[1] was because I found that if I hand-coded the SQL and programmed around that, I was more productive, and had fewer bugs.
The approach taken (based on our experience with LedgerSMB) is to focus on rules for mapping object oriented calls to stored procedure calls. The framework is of course PostgreSQL-only (and relies a lot on system catalog lookups).
My experience, backing what you are saying, is that if you know what you are doing, and you write good, maintainable stored procedures/UDF's then the power is totally worth it. Unfortunately, there are still some obstacles, and most of these are tooling.
I think the hatred of SQL from the parent (in the context of this quiz) is that if you separate performance from result, as SQL does, then you have to think about performance differently. In theory you code for results first and then tune performance later. In practice, this only gets you so far and you need to code with some knowledge of performance limitations. I know I have been bitten as badly as anyone else.
But, the tradeoff is that when I write more of my code into my queries, I get better performance, fewer bugs, and a significantly smaller codebase. The knowledge is worth it warts and all.
(When I have some really nice code samples, I will submit to HN)
Actually I think this highlights why it is important to understand and work with raw SQL when critical performance matters.
I'll let my ORM generate basic CRUD statements; the kind of statement which typically only involves PK lookups. But I find it better to write my own SQL statements for important queries, and of course let the ORM map the result set into objects. In fact, with enough experience, I've also found it _faster_ to just write a query in SQL versus learning yet another DSL or query builder API.
How do you think an ORM will help you avoid those performance issues? Unless the ORM isn't very feature-rich but then you've got other problems anyway.
SQL's syntax is ugly because it was designed in the 70s where some people had quite different ideas what a DSL should look like (hey, COBOL, you are guilty, too!)
Oh I don't have any false illusions. I know ORMs fail most of the time when you need very specialized queries - and I know how to do them - but I still hate them.
I honestly love SQL...I consider it a noble language:) It is the only declarative language out there that is still used today, and there are some solid reasons for that.
However, I believe the biggest mistake that was made in the creation of SQL was to try to emulate natural language flow patterns. This makes it hard to format and edit, and when people optimize towards writability, they kill readability (such as the use of leading commas. The lack of any real standard indentation practice makes it cooperative work frustrating. And the natural language flow actually confuses the writer/reader into thinking that order of operations is linear, as opposed to the FROM->WHERE->GROUP BY->HAVING->SELECT->ORDER BY->LIMIT order. This makes for really annoying logical bugs, such as how filter conditions in the where clause on a left-joined table can effectively turn your left join into an inner join.
The biggest mistake in SQL was in using NULLs to mean three different things (no record found in an outer join, vs not applicable, vs not yet known). Fortunately with some types (varchar) on sane db's you can use dedicated empty values ('' for varchar) for not applicable, but you are still stuck with the other two as possibly conflicting.
This is actually a big issue because your ability to have complex declarative constraints goes up with table width, so breaking off potentially nullable columns also removes them from being available for cross-column check constraints.
Expressiveness? SQL is quite verbose for what it does achieve and if you look at an SQL with some JOINs, I'm not sure you want to uphold that opinion of its expressiveness.
Don't get me wrong, I have no problem with SQL as such but it has some shortcomings. Probably most things can be blamed on its age.
Why is the syntax of INSERT and UPDATE statements unnecessarily distinct? They do almost the same thing. Why are the values and column names in the INSERT statements' syntax separated and not with an UPDATE statement.
Why are there keywords that consists of more than one word? (Maybe the same ill-guided idea as with COBOL that being able to "read" SQL would make it easier to write/maintain?)
The syntax of SQL is in many places not tight enough, so you often have to hunt down a missing comma or typo, because the SQL parser is just not able to find out what you've done wrong.
So, yeah, but, like UPDATE can employ a WHERE clause, to apply conditional logic, whereas INSERT, not so much. You insert one record to a table. As long as you know the table? Boom. Done. Oh, but now you'd like to change some records in the table? Well, that's great, but which ones? Tell me where?
I think the whole readable natural language premise is a neat idea, but much in the same way that natural language can be easily complicated, so too, with SQL. Heap abstractions of logic and arithmetic on top, and it gets even worse. But it's still a cool idea. If it works out in your favor, you'll get self documenting code. But alas, there is no floor or cieling to the complexity one can introduce, be it deliberately or accidentally. And not only is it the query itself (or the author) that might over complicate matters. The data schema, defined by the DDL, can easily induce unwanted complexity in subtle ways, and this might be a problem deliberately placed beyond the reach of the person attempting to read or change the data locked within.
As for complaints about parsers, that's not a deficiency of the language. It's a deficiency of the tools used. If the development environment isn't presenting problems to the developer in a convenient manner, then the utility sucks. Get a new one. Don't blame the text-mode clients though. They are just as bare-bones as a command line, and deliberately so. At a certain level, SQL can be regarded the same way a shell script might be regarded, in terms of possessing the convenience of (or lack thereof) a basic TTY command interpretter. But there's no real reason why it should be impossible to create a SQL script authoring tool, which provides syntax validation as convenient as any compiled language might employ, when communicating compilation errors to the developer.
38.2% of people got 4/5 or 5/5 questions right. That is amazing. I never would have guessed anywhere even close to that. I would have figured it would be around 15% or so given that guessing randomly would put it at 12.5%. I guess I get a skewed perspective from looking at applicants since presumably most of those 38% are happy with their current jobs.
I took the test twice so I'm guilty of skewing the stats. My reasoning - first time I got 4/5 and it told me 'you know a little bit about SQL performance tuning.' I was curious about what the 5/5 message would be. It's interesting on many levels that the message was the same.
I don't think it was guessing so much as reasoning that fetching 100 rows (and filtering by value) instead of 10 rows doesn't have significant real-world impact unless the row data is particularly large. I'll admit I didn't think of the out-of-index lookup, but my main thought was 100/1000000 vs 10/1000000 isn't a big deal unless the DB is running on an abacus.