More

cribwi · on April 11, 2023

A similar research is done for humpback whales by HappyWhale: https://happywhale.com/ They're doing a great effort with already over 600K user/tourist submitted photos of whale encounters to track individual whales over the globe

cribwi · on July 15, 2022

Great approach and good write-up! I’ve implemented a similar technique before on PostgreSQL, but with the materialisation in the application backend. Still works like a charm.

>So instead, we created a special kind of trigger that tracks the minimum and maximum times modified across all the rows in a statement and writes out the range of times that were modified to a log table. We call that an invalidation log.

Does this invalidation log also take into account cases where the view has an aggregate that is based on data from a bucket other than itself? For example, a lag() or lead() might be used to calculate a delta compared to the previous bucket. Then, if a data point inside bucket 1 is added into the realtime table and bucket 1 is invalidated and re-materialised, for integrity reasons also bucket 2 needs to be re-materialised?

cribwi · on Jan 25, 2022

In some cases for removing repeating (intermediate) calculations, I generally find it easier to use a lateral join (in postgres), like

    select
        title,
        country,
        avg(salary)         as average_salary,
        sum(salary)         as sum_salary,
        avg(gross_salary)   as average_gross_salary,
        sum(gross_salary)   as sum_gross_salary,
        avg(gross_cost)     as average_gross_cost,
        sum(gross_cost)     as sum_gross_cost,
        count(*)            as emp_count
    from
        employees,
        lateral ( select
            (salary + payroll_tax)                   as gross_salary,
            (salary + payroll_tax + healthcare_cost) as gross_cost
        ) employee_ext
    where
        country = 'usa'
        and gross_cost > 0
    group by  title, country
    having    count(*) > 200
    order by  sum_gross_cost
    limit 3;

jseban · on Jan 25, 2022

So now we have easily come up with three different ways of rewriting the query to avoid that duplication (which obviously was not a problem at all to begin with): subquery, CTE and lateral join. And there are also several more well known ways (views, custom functions, computed columns etc) so the whole premise now for even inventing a "better" language than SQL is then false? Or what am I missing.

It's also weird how people always argue for immutability and eliminating local state, when using procedural languages, but as soon as they switch to SQL, that actually works like this, they immediately want to introduce mutability and local state.

dspillett · on Jan 25, 2022

> so the whole premise now for even inventing a “better” language than SQL is then false?

I don't think anyone is using the above examples to try invalidate PRSQL, just suggesting the baseline for comparisons should account for all constructs available in the SQL standards and common implementations there-of.

> Or what am I missing.

The statement “I can do X better than <SQL example> with <something else>” does not properly show the benefit of <something else> if “I can do X better than <SQL example> with <another SQL example>” is also true (assuming <another SQL example> is actually agreed to be better, not for instance convoluted/confusing/long-winded/other so just replacing some problems with others).

paulhodge · on Jan 25, 2022

If there's multiple ways to do the same thing that's usually a BAD thing in terms of language design. Especially if some approaches are just newbie traps that experts learn to avoid, or if deciding the best method is a really subtle context-dependent decision. The ideal design is that the language encourages the one obviously "good" way to do it.

ako · on Jan 25, 2022

I'm not aware of any general purpose programming language that doesn't have multiple ways to achieve a specific goal. Can you give an example of language with good design?

cribwi · on March 24, 2021

If you're running PostgreSQL, you can use the built-in generate_series (1) function like:

  SELECT id, random()
  FROM generate_series(0, 1000000) g (id);

There seems to be an equivalent in SQLite: https://sqlite.org/series.html

[1] https://www.postgresql.org/docs/current/functions-srf.html

dintech · on March 24, 2021

Here's how to do something similar in Q (KDB)

    ([]x?x:1000000)

Which gives a table of a million rows like:

klibertp · on March 24, 2021

Nothing to do with the article, SQL, or a DB, but I can't help wanting to add even just a bit when I see array languaes mentioned. So, in J it's just:

    ?~1000000

or, if you don't need elements to be unique:

    ?$~1000000

Though it's just an array, not a persistent table - I don't know much about Jd :(

bigfudge · on March 24, 2021

An R vector would be `1:1000000`

klibertp · on March 24, 2021

Nope, that's - as far as I can see - just a sequence of increasing integers. Both K and J examples give an array of random integers in 0-1000000 range.

For reference, in J such sequence of integers can be generated with:

        1+i.1000000
    1 2 3 4 5 6 7 8 9 10 ....

or:

        1+i.1000000 1
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    ....

To explain the previous examples (and let's use smaller integer for less typing...), the `$` verb is called "shape"/"reshape", and it takes a (list of) values on the right side, and a list of dimensions on the left:

if there's not enough values on the right, they are cycled:

       5 2 $ 10 11 12
    10 11
    12 10
    11 12
    10 11
    12 10

which degenerates to repetition if there's only one value on the right. The `~` adjective (called "reflex") modifies a verb to its left in the following way:

       V~ x NB. same as x V x

so `$~10` is the same as `10$10`, which is a list of ten tens. That list is passed to `?`, which is a verb called "roll", which gives a random integer in the 0-(y-1) range when written as `? y`. `y` here can be a scalar, or a list, in which case the roll is performed for each element of the list:

       $~ 10
    10 10 10 10 10 10 10 10 10 10
       ? $~ 10
    1 6 9 4 6 8 8 7 9 4

The dyadic case, ie. `x ? y` is called "deal", which selects `x` elements from `i. y` list at random, without repetitions. `?~ y`, then, effectively shuffles the `i. y` list:

       ?~10
    6 9 7 3 5 1 8 0 4 2

"deal" can be used to shuffle any list, not only the `i. y` sequence, by using the shuffled list as indexes of another list (using `{` verb, called "from"):

       2*i.10
    0 2 4 6 8 10 12 14 16 18
       (?~10){2*i.10
    14 10 18 6 2 12 8 0 16 4

...I know, I know, it is strange. But it's so interestingly mind-bending that I'd be really happy if I had a valid excuse to pour hundreds of hours into learning J properly. Sadly, I don't have anything like that, so I only spread the strangeness from time to time in comments, like I do right now :)

All the primitives (words) are described here: https://code.jsoftware.com/wiki/NuVoc

sradman · on March 24, 2021

This can generally fall under the category of “Generate Test Data in SQL” with both generic techniques like Recursive CTEs, as in the OP, and SQL dialect or tool specific options. Search is your friend. The OP also provides a table lookup technique for readable names but doesn’t address other data types such as timestamps or other functions such as MD5(). Other data distribution techniques other than random may also be needed. This post scratches the surface of testing with generated data.

jeffbee · on March 24, 2021

If you're running PostgreSQL you can also just

  xxd -ps -c 16 -l $bignum < /dev/urandom | (echo '\copy random_data from stdin'; cat) | psql

Depending on how you want your random keys formatted.

em500 · on March 24, 2021

If you're importing from command line tools, you might as well use the specialized `shuf` tool:

  shuf -i 1-$bignum

or for random numbers with replacement

  shuf -i 1-$bignum -r -n $bignum

These give you a sample of random integers. `shuf` is part of GNU coreutils, so present in most Linux installs (but not on macOS).

fouc · on March 24, 2021

It's worth installing GNU coreutils on macOs All the command names are prepended with g, so `gshuf`

jeffbee · on March 24, 2021

Nice yeah that's a good way. Generally the number 1 million is so small I see no reason to do this in any manner other than shell commands.

schoetbi · on March 24, 2021

you have to use it since PostgreSQL does not support limit within a with expression.

jolmg · on March 24, 2021

You can use a with-expression. It's true that you can't use `limit` to limit it, but you can use a `where` condition. This is equivalent to cribwi's example:

  WITH RECURSIVE
    t AS (
      SELECT 0 id
      UNION ALL
      SELECT id + 1
        FROM t
        WHERE id < 1000000
    )
    SELECT id, random() FROM t;

It returns 1000001 rows, but so does cribwi's.

deepstack · on March 24, 2021

use a sequence or uuid

cribwi · on Feb 19, 2021

You can assign a custom shortcut to it via Settings > Keymap and then the action "Main Menu > File > Local History". I myself have assigned the shortcut control+shift+=

jansan · on Feb 19, 2021

Thanks, did that. I decided on Alt+Z, because Ctrl+Z is Undo, so I though this makes sense.

matsemann · on Feb 19, 2021

And Ctrl+Alt+Z is revert changes. Either the whole (or selected) file, or the current block you're in. I use it all the time.

cribwi · on Aug 18, 2020

Try using https://m.xkcd.com/ and click the comic to toggle the alt text

cribwi · on March 4, 2018

I know that specific story from the book "How not to be wrong", by Jordan Ellenberg. It was nice to read it from a different angle in this article

cribwi · on Dec 1, 2014

Part 1: https://news.ycombinator.com/item?id=7654439 http://vimeo.com/85490944

cribwi · on Nov 24, 2014

I've been using Phabricator for almost a year now in our company. At first, I also had my questions about the workflow with arcanist (the local PHP cli software to interact with Phabricator). However, right now I really love it and miss it on every repository which works without Phabricator. Commands like 'arc feature', 'arc tasks' and 'arc inlines' provide a solid abstraction over the repository to bring the workflow to the cli.

About the unit tests: indeed, out-of-the-box Phabricator only supports running them locally (you can configure the unit engine to run every time a piece of code is been send into review). Luckily, it's not that hard to hook into the event engine (https://secure.phabricator.com/book/phabricator/article/even...) to, for example, trigger your CI-server to run the test suite for a specific revision. Also, Phabricator is still being developed actively. There are a bunch of alpha and beta applications which will only make it more powerful (for example: DryDock and Harbormaster, which will ease the work for automatic builds: http://www.guywarner.com/2014/05/integrating-jenkins-and-pha...)

In my opinion, Phabricator is one of the most powerful tools around for doing code reviews and effective collaboration management. Even with the large amount of beta applications (which are promising in many cases) the decision for us was easy to make.

ashot · on Nov 25, 2014

does arc inlines work for you?

cribwi · on Nov 25, 2014

Since it has been broken some time ago: no. But when it was still working: yes, I found it rather useful. Especially with very small commits/bugfixes and such. I hope they'll fix it soon

cribwi · on March 19, 2014

Funny to see a new player entering the game to start button bashing and ruining the playing field in a matter of seconds :)