Hacker News new | past | comments | ask | show | jobs | submit | ajfriend's comments login

I have project that's still very much at the experimental stage, where I try to get something similar to this pipe syntax by allowing users to chain "SQL snippets" together. That is, you can use standalone statements like `where col1 > 10` because the `select * from ...` is implied. https://ajfriend.github.io/duckboat/

    import duckboat as uck

    csv = 'https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv'

    uck.Table(csv).do(
        "where sex = 'female' ",
        'where year > 2008',
        'select *, cast(body_mass_g as double) as grams',
        'select species, island, avg(grams) as avg_grams group by 1,2',
        'select * replace (round(avg_grams, 1) as avg_grams)',
        'order by avg_grams',
    )
I still can't tell if it's too goofy, or if I really like it. :)

I write a lot of SQL anyway, so this approach is nice in that I find I almost never need to look up function syntax like I would with Pandas, since it is just using DuckDB SQL under the hood, but removing the need to write `select * from ...` repeatedly. And when you're ready to exit the data exploration phase, its easy to gradually translate things back to "real SQL".

The whole project is pretty small, essentially just a light wrapper around DuckDB to do this expression chaining and lazy evaluation.


They also only have one type of neighbor. Square grids have 2 neighbor types. Triangular grids have 3.


Makes perfect sense. Thanks both


It depends on if you want to model a point or an area. lat/lng gives you a point, but you often want an area to, for example, count how many people are in that area. A spatial index like H3 provides a grid of area units.


But so do lat long ranges.


You can use those if they work for your application. One downside would be that you're storing 4 numbers compared to a single `int64` index with H3.

You also have to decide how you'll do that binning. Can bins overlap? What do you do at the poles? H3 provides some reasonable default choices for you so don't have to worry about that part of your solution design.


...and use H3 instead! https://h3geo.org/


Very different use case -- ZIPs/ZCTAs have some semblance of population normalization


If you care about that and have a data source, you can add, for example, population density per H3 cell as part of your analysis. That has the additional benefit of denoting the this quantity of interest explicitly, rather than some implicitly assumed correlation which may not be true.


Hey AJ, this is almost on topic, do you know of a more up to date version of the dataset you used on the blog post release for H3 v4.0.0 [1]? They stopped updating in Oct 2023. Thanks! [1] https://data.humdata.org/dataset/kontur-population-dataset


I don't. And maybe I should have emphasized "and have a data source" more, since its doing a lot of the heavy-lifting in my statement :)


Not necessarily true. The population isn't balanced at all between many. Census units are.


Absolutely this. Use other Census areal units if you can and ZCTAs only if you have to.


What H3 do I belong to if my house is split between three different ones, pretty much equally? Any/all of them?


You take a smaller H3 :-) The maximum area of a resolution 15 H3 is 1 square meter, so unlikely to split a house in two.


What is the benefit of H3 over a rectangular grid?


That map does seem to be using H3 hexagons: https://h3geo.org/


Oh man, what an exciting opportunity. clears throat The hacker news title seems to mistranslate the original Em dash to an En dash.


We use a submodule in https://github.com/uber/h3-py to wrap the core H3 library, which is written in C. Submodules seemed like a reasonable way to handle the dependency, and, at least for this use case, the approach hasn't given me any problems.


You can compose SQL with https://ibis-project.org/tutorials/ibis-for-sql-users, which is using https://github.com/tobymao/sqlglot to parse the SQL under the hood.

As an alternative to parsing the SQL yourself, DuckDB's https://duckdb.org/docs/api/python/relational_api allows you to compose SQL expressions efficiently and lazily, which I've used when playing around with things like https://gist.github.com/ajfriend/eea0795546c7c44f1c24ab0560a...


One approach I've been enjoying recently in my personal use is to write a light wrapper around DuckDB to enable composable SQL snippets. Essentially like what I have here https://gist.github.com/ajfriend/eea0795546c7c44f1c24ab0560a..., but without the `|` syntax.

You're still writing SQL, so you don't need to learn a new syntax, but I find it more ergonomic for quick data exploration. I also have an easier time writing SQL from memory than I do writing the equivalent Pandas code.


This DuckDB blog post on range joins might be relevant: https://duckdb.org/2022/05/27/iejoin.html


Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: