Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm worried about the inverse - what Python data analysts should know about SQL. Because I've met tons of analysts who wouldn't be able to even run a basic select.

I've seen tons of (often non-reproducible) code written in place of a simple SQL query. I really wish bootcamps and other learning platforms focused on SQL a bit more.

(I am a Python data analyst who properly learned SQL only after several years in the industry.)



I’ve seen devs just run select * from table then filter it and sort it in their own code. Then they complain “the database is slow” when it’s spending all its time shipping gigabytes of data they don’t need to them!


From what I encountered, this is generally the case when someone is in the "analysis/reports" mode. Rather than get summary statistics on each column, find number of nulls, etc by writing a sql query, they instead get the data into the Python/R instance, and use general purpose functions, utilities, etc. "Programmers are expensive" statement probably applies here as well. I'm not trying to be defensive here, just saying that this might be one reason.


If you believe "Programmers are expensive", then you should do as much as you can do with a declarative data manipulation language (usually SQL, you can also consider sequences of text manipulations tools using pipes) and leave that last 15-5% of high-value work to a more powerful but also verbose imperative lenguage (usually Python, but any).

Asking for what you want is considerably faster than saying how you want it done.


From what I encountered, this is generally the case when someone is in the "analysis/reports" mode

I understand this use case, but this is in actual application code!


Is anyone working on a translator for pandas dataframe syntax to SQL?


In the R world, dbplyr[1] does this amazingly well.

[1] https://dbplyr.tidyverse.org/


tidyverse is just an mind-boggingly amazing ecosystem of packages

Forget Da Vinci, the first man to be cloned should be Hadley Wickham



Dang, Ibis doesn't support Redshift or SQL Server. I'm also having trouble understanding what it really is - it's an entire framework for big data it seems and not just a translator. What I'd really like is just that, something that turns pandas dataframe operation into ANSI SQL. So input pandas2sql('tablename["col"]') -> "select col from tablename". Something really simple to use.


Pandas has from_sql and to_sql methods that are compatible with SQLAlchemy if you insist on using an ORM, that gets you most of the way there...


SQLAlchemy is more than just an ORM. It also has sql expression language, for writing queries using python without using any ORM features.




Im surprised ppl dont use ORM libs for this instead..


ORMs are just as capable as any developer of generating bad SQL queries that crush databases and plug network connections.


Yes, but in the example used, all orm libs I know would fetch only a single column or two (often for a single row). While "select *" would fetch everything.


Or worse, they use cursors. Cursors should never be used in SQL. Ever. That's my philosophy.


What is a better approach? I have Python that directly connects to an Oracle database, and the Oracle blog tutorial[0] for using their Python package always uses a cursor.

[0] https://blogs.oracle.com/oraclemagazine/perform-basic-crud-o...


I imagine none of you or the GP is talking about in-database cursors, that you open in SQL, use on the same SQL script, close at the end of the script and move along. There isn't really a problem inherent to those, and they stay non-problematic if you are writing your database scripts on Python, C, or whatever.

For a server reading and writing from/on a database, you paginate your queries. On Postgres that would be using "limit" and doing small bulk inserts, but under Oracle your options for inserting are limited, so there is still value on cursors.

There is the odd occasion where you'll will iterate through the entire results set, do something fast for each and every row, and only get something useful on the end. Those are classic problems where cursors are more efficient than the alternatives. But even on those, once you add error handling and recovering may yield better results with pagination.


They talk about client-side cursors, these are part of the Python DB API. What you shouldn't use is server-side cursor, which you create with DECLARE in PosgreSQL. These keep their state on the server and are intended for optimizations for special cases like streaming data processing or realtime updates. Basically, for deep internals of realtime systems, and not regular queries.


What you shouldn't use is server-side cursor, which you create with DECLARE in PosgreSQL.

That entirely depends on the DB. Oracle for example always executes with a server-side cursor; all declaring it does is give you a handle to what it was doing anyway.


The problem isn't the cursor itself, but that it is usually a symptom of procedural thinking vs set based thinking. In an RDBMS it's typically far faster to puzzle out the joins, CTE's, and set based expressions and functions to use to winnow down a dataset vs a cursor based procedural logic on a row by row basis.


Thank you for clarifying that; I was wondering where the concern was coming from.

Without disagreeing with any of the above, one important consideration is what you're going to do with the query. If all you want to know is a column's mean or some other simplified statistical value, there's really no sense in pulling all the data into Python just to calculate it. Do it inside the DB itself with SQL.

On the other hand, if you need that data to do other work (i.e. populate the table in a webpage, or generate a new descriptive data set or whatever), then the trade-off for pulling it into Python/pandas and running a mean in addition to the other work becomes much smaller.

My approach is usually to do as much data filtering and parsing as possible inside SQL, but things like complex parsing and string manipulation (especially!) I'll do with Python. I can do some simple string work in SQL, but I can almost always do it faster and cleaner in Python.


Why? (I am not taking a contrary position by asking).

I'm not really an RDBMS guy (but I can write a select query from scratch!), but IIRC, the Netscape Server API (where Javascript got it's start - not in the browser!), there was heavy use of and expectation in the NSAPI of cursors for tabular data table scrolling and the like. I don't recall if they were in the DB or the HTTPD server.

But I do know that if you didn't want a 20,000 row response to your query to be HTML-ized to be transmitted and then displayed (sometimes over a 56kbps modem), you used cursors in the NSAPI.

I have no idea about whether that was a good design decision at the time, and even less idea now, but I kind of incorporated the practice, and didn't know it was "bad".

It is relevant for large web-API end-point responses (pagination) even now, no?

So! Why is that bad? :-)

EDIT: Please don't be snarky - I spend more time figuring out clock-skew on high-frequency mixed-signal boards than I do talking to a database, and I'd like to learn to be better at the latter.


How's PCIe Gen4 treating you? :)


Ha! I'm in more "not open" systems, but if there is zero noise, it's great! :-)


It all depends on the database and the API in use. When handling sqlite with Python I often use Roger Binns' APSW. (https://rogerbinns.github.io/apsw/cursor.html#cursors)

With an in-memory database or running it on an NVMe drive you can get some ridiculous performance out of Sqlite using APSW cursors.


To my knowledge, you can't get data from an RDBMS without a cursor - kinda required to do even a simple SELECT. I'm guessing what you're referring to, is keeping a cursor open from a Python process that should have been closed after the results were brought into memory.


Generally you'll find at some point a mix of set based and loop based logic wins the day in most SQL's - but shy away from those cursors until you absolutely must use them.


Server-side cursors are great if you need to do streaming which I have in the past. It keeps both the client and server memory use in check.


I wouldn’t be so orthodox, there may be some special cases if a database is used in non standard ways to do strange stuff. However, if cursors are used to retrieve or update data, to do actual database stuff, cursors always look like a capitulation and retreat into procedural territory. If you use cursors you’re not giving the database a chance to shine and make it very sad.


Could you elaborate further?



For mssql yes but oracle is fine with cursors, what rdbms?


What’s wrong with cursors aren’t they just iterators?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: