We give people a series of dropdowns on columns on tabular data, and they can do...

We give people a series of dropdowns on columns on tabular data, and they can do things like IN and BETWEEN queries for date and numeric data, as well as literal equality for string data.

The catch is that they can do this filtering on columns that are, behind the scenes, retrieved via joins. And the data set being searched may be many millions of rows. And it needs to come back interactively, i.e. no more than a second or two. And the schema is customer-defined.

I implemented a syntax tree that is composed from the various columnar dropdown filters and get passed as a big AND to the API. The API essentially recieves a full syntax tree for a predicate, but each field referenced in the predicate might be only available through a series of joins, or via the main table - it requires some analysis.

The database backing this is MySQL. MySQL has a fairly simple query optimizer, and it is reliably poor at being given complex predicates over queries with lots of joins. So, we split up the query: we try and tear off ANDed predicate parts that relate to the core table, push them into a derived table (i.e. nested query) along with the sorting predicate, if possible. We then join in just the tables required for evaluating the rest of the predicate, in the next level out. And finally, in the outermost query, we do all the joins required for the final page of results, joins that are necessary to retrieve all the columns that haven't been used by filters or sorts.

With this level of analysis of the predicate being executed, we could do things like reject overly complex queries. As it is, our underlying data model is limited somewhat in the expected volume on a time period basis, and we have some defaults over the maximum time window that can be searched, so it's mostly under control.

Occasionally new devs add in extra complex joins with derived tables that are a little bit too complex for interarctive querying, and staying on top of that is a challenge. It's needing a bunch more performance testing before releases, whereas we've been mostly able to get by with retrospective performance monitoring until now. But the team is bigger, not every query modification gets sufficiently expert eyes, and so we need to beef up our processes.