> particularly multiple calls to the database for very simple join operations th...

overcast · on Nov 4, 2019

When you have millions or billions of rows, SQL calls absolutely become a huge bottleneck. I deal with this all the time from shortsighted developers at the office in other ORM environments. Most recent being a SELECT...NOT IN('a','b','c'), causing an INDEX SCAN on 200 million rows, and then complaining it takes 6 minutes to run.

Look at Django's select_related. It's one of those if you don't understand what's happening under the hood, then you're probably querying way more than you should be.

hombre_fatal · on Nov 4, 2019

I like having some sort of db:seed task during early development that seeds 1MM+ rows in all of your tables to help you experience performance issues as they're created.

It's even more important if your production application has millions of rows. Too easy to create a system that runs perfectly on 10 rows but will crash your production server as soon as you deploy it. Forgetting to create an index on the FKs is a classic one.

overcast · on Nov 4, 2019

Yes of course, but let's be realistic, would the people doing this type of clueless work, bother to seed a database? It's a self fulfilling prophecy of naivety. It's like all of the investing and saving advice on Yahoo Finance. The people reading it are already the ones doing it, because they are interested in learning more about it.

bshipp · on Nov 4, 2019

select_related caught me when I first started using Django. It's a sneaky one because the queries worked fine when I first wrote the program. Then I started populating the database and, over the course of months, the queries got slower and slower and slower.

Eventually I was forced to pop open the hood and horrified to find this spaghetti bowl of nested, duplicate queries that took a fair bit of work to simplify and optimize. I was not so lucky as to have a DBA I could dump my problems on and was forced to learn that lesson the hard way.

aldoushuxley001 · on Nov 4, 2019

What was it about select_related that slowed your queries? What'd you do to fix them? Did you have to abandon select_related, or just tweak its parameters?

I'm just building out a Django app now and using select_related, or rather prefetch_related, for retrieving tags (m2m relationship). Seems to work well so far, but I'm I'm sure I'll run into a similar thing of having to optimize all these queries soon.

bshipp · on Nov 4, 2019

Similar to what @overcast said: initially I didn't use select_related at all. Almost immediately I saw huge DB utilization with hundreds of thousands of tuples returned for (what I assumed) were pretty simple queries. I realized, as @overcast said, that it was looping instead of asking for it all at once, so I added indexes and appended "select_related" to almost every query. Then I figured it was fixed.

Once my database hit 100GB and a few hundred million rows I had no choice but to sit down and actually learn what each of my ORM commands was asking my database to do. Sometimes I removed a select_related. Sometimes I replaced it with prefetch_related. Sometimes I eliminated an entire filter operation or moved it elsewhere. A few times I injected a greatly simplified raw SQL query instead of relying on complex ORM generated SQL. In four instances I replaced expensive join operations with periodically rebuilt "materialized views" to reduce CPU usage and DB I/O. All was timed with django-debug-toolbar and/or pghero to minimize database impacts and network congestion.

So select_related was sneaky in the sense that I thought I had solved the problem very early on, when I had merely delayed it until much later. If your database always remains small you'll likely never encounter this issue.

The solution is to not fire and forget the application, but to install something like django-debug-toolbar and monitor what your program is doing as the database grows in size. But for heaven's sake, don't worry about that problem today. Get your app working so you can make money. Once it's done, however, remember that your ORM has put a thick collar on your new puppy, and as it grows you'll need to expand that collar or you'll slowly strangle your pet.

overcast · on Nov 4, 2019

selected_related is the solution, not the problem. Without it, Django would make a database query literally every single loop iteration. My point is that without paying attention, Django makes it very easy to query things, sub-optimally.

overcast · on Nov 4, 2019

PRECISELY my point. Everything is easy to setup, and works amazing, when nothing is actually in the database.

andybak · on Nov 8, 2019

> When you have millions or billions of rows, SQL calls absolutely become a huge bottleneck.

Not necessarily. It's the specifics of the queries not the number of rows.

Which is why you measure before you optimize.

bshipp · on Nov 4, 2019

django-debug-toolbar is excellent, and I use it on my django sites as well.

In addition, the documentation does a pretty good job of highlighting some of the common gotchas. Unlike your environment, my page/js weight is very low but I'm querying against a few hundred million records joined across many tables. Even using materialized views to eliminate the impact of joins in postgres, it's required a fairly delicate touch to make the delay for page loads tolerable.

In that respect I would likely redo the project in flask and sqlalchemy, if only because then I wouldn't have to remember the syntax nuances of two separate ORMs. They're similar, but not identical, and it's infuriating at times. Plus I'm very comfortable dipping down into raw SQL in sqlalchemy, and it hasn't been as intuitive for me with Django.