Messing with data significantly outside of SQL is often asking for trouble.
SQL queries are compiled into very efficient operations which would take a lot longer to get right imperatively. Not only that, but database engines are improving all the time, so the same code you wrote which declares your desired transformations tends to get faster over time and there is nothing to update or refactor. The transformations you write are sort of timeless because they are strongly decoupled from the implementation and hardware.
Lots of data transformation steps in imperative languages require the persistence of an intermediate calculation (e.g., np.ndarray). SQL database only do this when the query planner deems it absolutely necessary. It will show up in your query plan as a "materialize" step.
The EXPLAIN feature of SQL is the most useful performance debugging tool. It also alerts me to a potential logic flaw quickly when the proposed plan looks insane. I have personally replaced several analytical programs a very large bank used to monitor loan originations and performance.
I don't use any special SQL language features to this day. The most sophisticated it typically gets is involving lots of subqueries, joins, and window functions. The real skill is distilling the convoluted imperative mess into its essence. I got really good at this.
Half the work effort is usually spent on socializing the changes I have to make to their logic to get it working right in SQL when it changes the behavior of their existing application's logic. Often times I find the client's imperative code attempting to perform a logical operation such as a join but it is implemented incorrectly.
Their existing imperative code operations actually produced the wrong results (subtly) frequently or their imperative code depended on the order of the data returned from the database (undefined behavior). Yikes.
What they actually wanted was provably implemented incorrectly or relied on undefined behavior and their mistakes and the proper resolution in SQL could be easily verified with paper and pen on a sample set of loans if necessary to drive the point home.
> Not only that, but database engines are improving all the time, so the same code you wrote which declares your desired transformations tends to get faster over time and there is nothing to update or refactor. The transformations you write are sort of timeless because they are strongly decoupled from the implementation and hardware.
Only true up to a certain extent. Counterexample: performance regressions due to changes in the query planner / its input statistics. Changes aren't always positive and logically equivalent plans can have very different perf characteristics.
Fully agree. I am mostly on the analytical side. When my client uses Snowflake it's usually smooth sailing because it's so automated and performant. When I have my own analytical Postgres instance on my local machine I tune costs and memory consumption parameters in postgres.conf but I only rarely run into major gotchas. If my client uses IBM... I go for a walk on the beach or go out to lunch when I launch my query.
Your point about equivalent plans can have very different perf characteristics is very true. I always try to review the query plan with EXPLAIN if my query takes more than a minute and rewrite the logic if necessary.
Very cool. So you write it with solid, declaritive SQL and you can trust that it will be rock solid and optimizied. Need to learn SQL instead of just doing NoSQL all the time. Thanks for the explanations.
I load JSON into Postgres all the time these days for analyzing data for a government client and use Postgres JSONB operators to untangle and index it. JSONB because I don't care about preserving the original string representation of the record (strings in fields are still completely intact).
Although I heavily lean on psql with \COPY and stdin or stdout pipes to a compressor like zstd (psql can natively pipe CSV to and from any program reliably even in Windows) I found loading JSON records to be extremely frustrating this way.
Whatever you do NEVER use pipes in Powershell. They don't stream. They buffer in RAM fully into your computer crashes. Microsoft is insane.
Since you use NoSQL you can write a very short Python program that uses psycopg2 directly to load a list of dict as JSONB rows into a single Postgres table with one column (I call mine "record")
At that point you can basically use Postgres as a NoSQL database and structure the records if you want using a VIEW.
We're in the process of documenting their use of JSON for financial records keeping as a design defect.
Their existing imperative code operations actually produced the wrong results (subtly) frequently or their imperative code depended on the order of the data returned from the database (undefined behavior). Yikes.
SQL queries are compiled into very efficient operations which would take a lot longer to get right imperatively. Not only that, but database engines are improving all the time, so the same code you wrote which declares your desired transformations tends to get faster over time and there is nothing to update or refactor. The transformations you write are sort of timeless because they are strongly decoupled from the implementation and hardware.
Lots of data transformation steps in imperative languages require the persistence of an intermediate calculation (e.g., np.ndarray). SQL database only do this when the query planner deems it absolutely necessary. It will show up in your query plan as a "materialize" step.
The EXPLAIN feature of SQL is the most useful performance debugging tool. It also alerts me to a potential logic flaw quickly when the proposed plan looks insane. I have personally replaced several analytical programs a very large bank used to monitor loan originations and performance.
I don't use any special SQL language features to this day. The most sophisticated it typically gets is involving lots of subqueries, joins, and window functions. The real skill is distilling the convoluted imperative mess into its essence. I got really good at this.
Half the work effort is usually spent on socializing the changes I have to make to their logic to get it working right in SQL when it changes the behavior of their existing application's logic. Often times I find the client's imperative code attempting to perform a logical operation such as a join but it is implemented incorrectly.
Their existing imperative code operations actually produced the wrong results (subtly) frequently or their imperative code depended on the order of the data returned from the database (undefined behavior). Yikes.
What they actually wanted was provably implemented incorrectly or relied on undefined behavior and their mistakes and the proper resolution in SQL could be easily verified with paper and pen on a sample set of loans if necessary to drive the point home.