More

valw · on Dec 28, 2022

Or, you know, use a time-aware database like XTDB or Datomic.

valw · on June 21, 2022

Btw, am I alone in thinking that DataFrame abstractions in OOP languages (like Pandas in Python) are oftentimes simply inferior to relational algebra? I'm not sure that many Data Scientists are aware of the expressive power of SQL.

devin-petersohn · on June 21, 2022

There are loads of things that are not possible or are very cumbersome to write in SQL, but that pandas and many other dataframe systems allow. Examples are dropping null values based on some threshold, one-hot encoding, covariance, and certain data cleaning operations. These are possible in SQL but very cumbersome to write. There are also things that are outright impossible in a relational database related to metadata manipulation.

SQL is super expressive, but I think pandas gets a bad rap. At it's core the data model and language can be more expressive than relational databases (see [1]).

I co-authored a paper that explained these differences with a theoretical foundation[1].

[1] https://arxiv.org/abs/2001.00888

valw · on June 21, 2022

Thanks for sharing this. I believe we essentially agree: chaining method calls is inexpressive compared to composing expressions in an algebraic language.

Myrmornis · on June 21, 2022

I'm not defending Pandas but just want to point out that the inability to conveniently compose expressions is one of the biggest problems with SQL, since it was designed to be written as a sort of pseudo-English natural language, in an era when people imagined that it would be used by non-programmers. To be clear, that's a problem with SQL, not with the idea of a language based on relational algebra. There are various attempts to create SQL-alternatives which behave like real programming languages in terms of e.g. composability. This blog post makes the point better than I can:

https://opensource.googleblog.com/2021/04/logica-organizing-...

valw · on June 21, 2022

I absolutely agree - one of the biggest shortcomings of SQL is that its primary programming interface is based on text and intended for human, instead of being based on data structures and intended for programs.

wenc · on June 21, 2022

SQL does not exactly implement relational algebra in its pure form.

SQL implements a kind of set theory with relational elements and a bunch of practical features like pivots, window functions etc.

Pandas does the same. Most data frame libraries like dplyr etc. implement a common set of useful constructs. There’s not much difference in expressiveness. LINQ Is another language around manipulating sets that was designed with the help of category theory, and it arrives at the same constructs.

However SQL is declarative, which provides a path for query optimizers to parse and create optimized plans. Whereas with chained methods, unless one implements lazy evaluation one misses out on look aheads and opportunities to do rewrites.

valw · on June 21, 2022

> There’s not much difference in expressiveness

> However SQL is declarative

Pick one :) the way I see it, if declarativeness is not a factor in assessing expressiveness, then expressiveness reduces to the uninteresting notion of Turing-equivalence.

wenc · on June 21, 2022

Expressiveness and declarativeness are different things, no?

Are you talking about aesthetics? I’ve used SQL for 20 years and it’s elegant in parts but it also has warts. I talk about this elsewhere but SQL gets repetitive and requires multi layer CTEs to express certain simple aggregations.

lenwood · on June 21, 2022

Agree. I've completed data pipelines for several projects and have found that the cleanest, and often fastest solution is to use SQL to structure the data as needed. This is anecdotal and I'm not an expert with SQL, but I haven't come across a situation where R or Pandas dataframes worked better than a well written query for data manipulation. This has the benefit of simplifying collaboration across teams because within my company not everyone uses the same toolset for analysis, but we all have access to the same database. Other tools are better suited to analysis or expansion of the data with input from other sources, but within our own data SQL wins.

deepsun · on June 21, 2022

Often -- yes. Always -- no.

For example let's try changing/fixing sampling rate of a dataset (.resample() in Pandas).

Or something like .cumsum() -- easy with SQL windowing functions, but man they are cumbersome.

Or quickly store the result in .parquet.

But all the above doesn't matter, because I feel like 99% of Pandas work involves quickly drawing charts on the data look at it or show to teammates.

valw · on June 10, 2022

> factoring prime numbers

factoring into prime numbers ;-) (factoring a prime number is trivial, that's why it's called prime)

psyc · on June 10, 2022

It's only trivial if you already know it's prime. Determining that is non-trivial enough that a tractable deterministic algorithm wasn't devised until 2002, and its time complexity is thought to be the sixth power of log(n = digits).

valw · on June 10, 2022

> Are quantum computers overhyped?

Well, let's see:

> When given the same problem, a quantum computer should be able to trounce any supercomputer in any problem in terms of speed and efficiency

LOL, no, not any problem, far from it. Some problems, rather specific ones, such as prime factoring.

> Our current system, for example, taps into electrons and cleverly-designed chips to perform their functions. Quantum computers are similar, but they rely on alternative particle physics.

Um, no, they both rely on the same physics, that is a combination of Quantum Mechanics and electromagnetism. Note to the author: an electron is a quantum system, and classical electronics definitely rely on that.

So yes, quantum computers are overhyped, through no faults of their own, and this article contributes to the trend.

Karellen · on June 10, 2022

> Some problems, rather specific ones, such as prime factoring.

You don't need a quantum computer for that! I can factor arbitrarily large primes in my head. For any given prime p, it's factors are 1 and p. Done!

:-)

valw · on June 11, 2022

I made the same remark in reply to another comment which used the phrase "factoring primes" :) Wikipedia does use the term "prime factorization": that seems legit to me, as prime is used as an adjective. https://en.wikipedia.org/wiki/Integer_factorization

Karellen · on June 12, 2022

Another legitimate meaning might be "factoring probable primes" (or "candidate primes" as they are sometimes called in key generation/cryptanalysis), or possibly "factoring semiprimes".

Both of those phrases could be referred to as "prime factorisation" in a not-entirely-accurate-but-unambiguous-in-context shorthand.

https://en.wikipedia.org/wiki/Probable_prime

https://en.wikipedia.org/wiki/Semiprime

packetlost · on June 10, 2022

> LOL, no, not any problem, far from it. Some problems, rather specific ones, such as prime factoring.

Yeah, as someone who works in quantum computing this is the hardest thing for me to explain to non-technical people. For technical people, I liken it to a FP unit or some other specialized coprocessor that's often embedded in CPU/GPUs.

> Quantum computers are similar, but they rely on alternative particle physics.

I think it's fair to say this in reference to using different physical properties of electrons than what normal computers use. The physics rules are the same, but how you manipulate them is different, presumably (I don't know much of how photonic QCs work)

seanw444 · on June 10, 2022

I never thought of it that way for some reason. Always imagined mature quantum computers as being their own system. But it's possible a lot of them will be supplementary components to a classical computer. We have storage-over-PCIe, graphics-over-PCIe, and soon quantum-over-PCIe?

Schroedingersat · on June 10, 2022

It'll be a long time before they need remotely comparable bandwidth, and more than likely the latency on higher level protocols won't even be near. PCIe would work fine, but so would old school serial.

anon946 · on June 10, 2022

That seems unlikely to happen in the near to medium term. For that to happen, everything would have to be rewritten using a quantum algorithm and language, and run on quantum hardware. Imagine writing a web browser in a quantum language, within a quantum computing software ecosystem. It's hard to see how that would have any benefit.

If you are talking 100 years out, though, who knows?

martincmartin · on June 10, 2022

Yes it's overhyped, but to be fair, the whole point of classical electronics is to hide the quantum nature as much as possible. You want your transistor to act as a deterministic switch, not be in a superposition of states.

valw · on June 11, 2022

Well, to be also fair, we also want quantum electronics to be deterministic in their behaviour. The difference lies not so much in randomness as in leveraging intrication.

Rayhem · on June 10, 2022

> Some problems, rather specific ones, such as prime factoring.

This is absolutely how we understand the technology now, but I think it's worth noting that computing luminaries also thought "640Kb of memory was more than enough for anyone" and that "eight mainframe computers will serve the computing needs of everyone across the planet" at one point in time, too. Quantum computers are definitely overhyped and that may be all they're good for, but it's also possible we'll figure out how to do some crazy shit with them in the future, too.

Workaccount2 · on June 10, 2022

Singularity Hub is just a clickbait pop science site.

valw · on June 10, 2022

> the annual worldwide energy usage of blockchain technology is roughly equal to the annual US energy waste from machines plugged in while in standby mode. It is also significantly lower than the annual worldwide usage of Christmas lights, and wash dryers.

... and it benefits much, much fewer people.

That's analogous to the pro-air-travel disinformation argument: "air travel is only 2% of CO2 emissions". It is only so because air travel is only adopted by a small minority of people: that doesn't stop it from being insanely carbon-intensive.

valw · on April 1, 2022

April fool's joke : we'll pretend that bugs in software are distributed as a homogeneous Poisson process, AND that Poisson distributions are bounded, while we're at it.

Poisson d'avril!

valw · on March 3, 2022

Reminder that the Computer Languages Benchmark Game itself recommends against using it to draw general conclusions about performance of languages in real-world apps: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

> We are profoundly uninterested in claims that these measurements, of a few tiny programs, somehow define the relative performance of programming languages aka Which programming language is fastest?

Now, I challenge you to find a major bloated software where the main source of overhead is Python interpretation. IME it's always something else, like the surrounding UI framework.

The Office suite is written in C++ and is badly bloated, obviously not because of language execution overhead but because of technical debt, which if that's any indication recommends against using low-level languages.

teakettle42 · on March 3, 2022

> I challenge you to find a major bloated software where the main source of overhead is Python interpretation

In every piece of non-trivial software I’ve written in python, the main source of overhead has been Python interpretation.

I don’t think it’d be hard at all to meet your challenge.

igouy · on March 3, 2022

If it wouldn't be hard at all, we're left to wonder that you don't seem to have tried.

teakettle42 · on March 6, 2022

Performing a legitimate performance benchmark of even one piece of enterprise Python software — much less across a representative survey — is well beyond the reasonable scope of a comment board reply.

igouy · on March 7, 2022

So, too hard.

valw · on March 3, 2022

> Sir Tony Hoare famously said: “Premature optimization is the root of all evil.”

Er, wasn't it Donald Knuth? https://wiki.c2.com/?PrematureOptimization

commandlinefan · on March 3, 2022

It was, but the author can be forgiven for mixing them up. Knuth did say it originally, but Hoare repeated it (in writing), properly attributing it to Knuth. Knuth then read Hoare's quote, missed the attribution, forgot that he was the one who said it, and repeated it again in writing, mis-attributing it to Hoare.

MailNerd · on March 3, 2022

Sounds reasonable, source: http://shreevatsa.wordpress.com/2008/05/16/premature-optimiz...

Let me fix it.

MailNerd · on March 3, 2022

Not totally sure.

https://ubiquity.acm.org/article.cfm?id=1513451

> Every programmer with a few years' experience or education has heard the phrase "premature optimization is the root of all evil." This famous quote by Sir Tony Hoare (popularized by Donald Knuth) has become a best practice among software engineers.

valw · on Nov 30, 2021

I submit that Event Sourcing (or at least something very close to it) can be easy, once you've removed some technology locks. I've seen it happen with Datomic (https://vvvvalvalval.github.io/posts/2018-11-12-datomic-even...), but it can probably happen similarly with other tech, such as XTDB (https://xtdb.com/).

valw · on Nov 30, 2021

I wrote the blog post you cited (thanks!) but I disagree with both statements: that is not what is meant in the article.

1. I don't think Event Sourcing sucks - I think we are lacking accessible technology for supporting it. 2. For most difficulties encountered in Event Sourcing, I would rather blame distribution taken to the extreme.