More

ajfriend · 2025-02-13T16:00:32 1739462432

I have project that's still very much at the experimental stage, where I try to get something similar to this pipe syntax by allowing users to chain "SQL snippets" together. That is, you can use standalone statements like `where col1 > 10` because the `select * from ...` is implied. https://ajfriend.github.io/duckboat/

    import duckboat as uck

    csv = 'https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv'

    uck.Table(csv).do(
        "where sex = 'female' ",
        'where year > 2008',
        'select *, cast(body_mass_g as double) as grams',
        'select species, island, avg(grams) as avg_grams group by 1,2',
        'select * replace (round(avg_grams, 1) as avg_grams)',
        'order by avg_grams',
    )

I still can't tell if it's too goofy, or if I really like it. :)

I write a lot of SQL anyway, so this approach is nice in that I find I almost never need to look up function syntax like I would with Pandas, since it is just using DuckDB SQL under the hood, but removing the need to write `select * from ...` repeatedly. And when you're ready to exit the data exploration phase, its easy to gradually translate things back to "real SQL".

The whole project is pretty small, essentially just a light wrapper around DuckDB to do this expression chaining and lazy evaluation.

ajfriend · 2025-02-07T21:02:16 1738962136

They also only have one type of neighbor. Square grids have 2 neighbor types. Triangular grids have 3.

hammock · 2025-02-07T21:57:48 1738965468

Makes perfect sense. Thanks both

ajfriend · 2025-02-07T17:40:57 1738950057

It depends on if you want to model a point or an area. lat/lng gives you a point, but you often want an area to, for example, count how many people are in that area. A spatial index like H3 provides a grid of area units.

HappMacDonald · 2025-02-07T18:37:13 1738953433

But so do lat long ranges.

ajfriend · 2025-02-07T21:08:33 1738962513

You can use those if they work for your application. One downside would be that you're storing 4 numbers compared to a single `int64` index with H3.

You also have to decide how you'll do that binning. Can bins overlap? What do you do at the poles? H3 provides some reasonable default choices for you so don't have to worry about that part of your solution design.

ajfriend · 2025-02-07T17:36:50 1738949810

...and use H3 instead! https://h3geo.org/

sbrother · 2025-02-07T17:43:34 1738950214

Very different use case -- ZIPs/ZCTAs have some semblance of population normalization

ajfriend · 2025-02-07T18:30:03 1738953003

If you care about that and have a data source, you can add, for example, population density per H3 cell as part of your analysis. That has the additional benefit of denoting the this quantity of interest explicitly, rather than some implicitly assumed correlation which may not be true.

ingenieroariel · 2025-02-07T18:51:49 1738954309

Hey AJ, this is almost on topic, do you know of a more up to date version of the dataset you used on the blog post release for H3 v4.0.0 [1]? They stopped updating in Oct 2023. Thanks! [1] https://data.humdata.org/dataset/kontur-population-dataset

ajfriend · 2025-02-07T21:10:39 1738962639

I don't. And maybe I should have emphasized "and have a data source" more, since its doing a lot of the heavy-lifting in my statement :)

mattforrest · 2025-02-07T18:21:44 1738952504

Not necessarily true. The population isn't balanced at all between many. Census units are.

ellisv · 2025-02-07T18:31:47 1738953107

Absolutely this. Use other Census areal units if you can and ZCTAs only if you have to.

diggan · 2025-02-07T19:18:28 1738955908

What H3 do I belong to if my house is split between three different ones, pretty much equally? Any/all of them?

maxmouchet · 2025-02-07T19:45:46 1738957546

You take a smaller H3 :-) The maximum area of a resolution 15 H3 is 1 square meter, so unlikely to split a house in two.

hammock · 2025-02-07T19:45:11 1738957511

What is the benefit of H3 over a rectangular grid?

ajfriend · 2025-02-06T00:20:48 1738801248

That map does seem to be using H3 hexagons: https://h3geo.org/

ajfriend · 2025-01-29T02:17:38 1738117058

Oh man, what an exciting opportunity. clears throat The hacker news title seems to mistranslate the original Em dash to an En dash.

ajfriend · 2024-12-02T02:51:36 1733107896

We use a submodule in https://github.com/uber/h3-py to wrap the core H3 library, which is written in C. Submodules seemed like a reasonable way to handle the dependency, and, at least for this use case, the approach hasn't given me any problems.

ajfriend · 2024-11-28T19:02:13 1732820533

You can compose SQL with https://ibis-project.org/tutorials/ibis-for-sql-users, which is using https://github.com/tobymao/sqlglot to parse the SQL under the hood.

As an alternative to parsing the SQL yourself, DuckDB's https://duckdb.org/docs/api/python/relational_api allows you to compose SQL expressions efficiently and lazily, which I've used when playing around with things like https://gist.github.com/ajfriend/eea0795546c7c44f1c24ab0560a...

ajfriend · 2024-11-28T17:58:21 1732816701

One approach I've been enjoying recently in my personal use is to write a light wrapper around DuckDB to enable composable SQL snippets. Essentially like what I have here https://gist.github.com/ajfriend/eea0795546c7c44f1c24ab0560a..., but without the `|` syntax.

You're still writing SQL, so you don't need to learn a new syntax, but I find it more ergonomic for quick data exploration. I also have an easier time writing SQL from memory than I do writing the equivalent Pandas code.

ajfriend · 2024-07-23T15:50:14 1721749814

This DuckDB blog post on range joins might be relevant: https://duckdb.org/2022/05/27/iejoin.html