Hacker News new | past | comments | ask | show | jobs | submit login
Geospatial data science with Julia (juliaearth.github.io)
159 points by juliohm on Oct 11, 2023 | hide | past | favorite | 53 comments



I have a passion project 4x4anarchy.com that operates with a Python-MariaDB system for querying map data by latitude and longitude, transforming it into GeoJSON for map display. The website deals with sizable tables, approximately 1 GB in size. I've made extensive optimizations, relying on well-structured indexes, caching mechanisms, and query optimization to enhance performance.

Given these circumstances, how might the incorporation of Julia and some geospatial DB (PostGIS) contribute to further optimizing geospatial data retrieval and presentation, especially when dealing with large datasets and intricate geospatial operations?


It would depend on where most of the processing is happening.

PostGIS gives you the benefit of spatial indexes which are extremely performant.

I've seen Python GeoSpatial applications taking hours to finish processing which only took a few minutes when shifted onto PostGIS.

If you're also doing a lot of processing in Python, exploring other languages could also help. In the case of Julia you get a typed language that's also JIT compiled.


Geopandas has had spatial indexing available for quite a long time...

https://geopandas.org/en/stable/docs/reference/sindex.html

I think that the challenge for most is that the PostGIS query planner does the indexing for you in most queries, while a naive all-pairs comparison in geopandas/shapely won't tell you to use the .sindex attribute instead.


Interesting! I work on a very similar product.

I don't know Julia well, but I definitely would suggest exploring whether PostGIS can help improve the speed of your DB queries.

I'd also consider how you deliver your geospatial data to your clients -- I'm not sure GeoJSON is your best bet. Protobuf tiles might be better for your use-case (e.g. the Mapbox Vector Tiles spec).


I completely agree! It would be hard to overstate the power of PostGIS!

For anyone working with GIS data, it's absolutely worth investigating what PostGIS provides and the ease of integration to your existing application!


I know you said it's a passion project, but you should probably still give the correct OSM attribution

https://osmfoundation.org/wiki/Licence/Attribution_Guideline...


I appreciate you calling that out, I will get that done.


Cool site! Any chance of a adding a simple KMZ export for offline use for a given area of interest?


Yeah, I can do that. Will get to it tomorrow!


Awesome - getting KMZs of 4x4 routes is way harder than it should be. All the Colorado data is there but extracting it is challenging.


If all you do is "find records within x miles from lat,lon", solr/ES is the best solution. I think it can match a shape too.


Nice thing about Julia is that you randomly find cool projects like this.


Be mindful that most of julia's geometry code is a wrapper of libGEOS (C version) and libGDAL, that means that you can't easy extend the algorithms, everythig is behind a black box on the C side. Source: I have worked in the field last year, I have a small patch in LibGEOS.jl .


So in other Julia geometry-related projects that may be true, but for this particular corner of the ecosystem the main author (Júlio Hoffimann) has actually implemented much of the underlying geometry and other code from scratch (to the best of my understanding) in pure Julia in a whole set of packages, including e.g.

https://github.com/JuliaGeometry/Meshes.jl https://github.com/JuliaGeometry/Rotations.jl https://github.com/JuliaEarth/GeoStatsBase.jl https://github.com/JuliaEarth/PointPatterns.jl


Exactly. It is a huge effort. Thanks for pointing it out @cbkeller.


This is not true. Please read the book.


Scanning the site see mostly points algorithms, the only mention of polygons is a textbook LibGEOS call, I see no network at all. And I see no smart manipulation of anything else than points, I see no subdivision of space, etc.


You probably need to re-scan the book. Meshes.jl is the submodule of the project entirely written in Julia with geometric processing algorithms.


I have worked with it. It was just stating, very little useful code in it. Going back to the source code, I see they added a bit more. A quick look around suggest that only one algorithm uses an indexing structure. Clipping seems limited between a convex polygon and a concave one.


The book is quite interesting, but it does seem like a lot of the underlying work is farmed out to GeoStats.jl, which doesn't really seem to use the same vocabulary I'd expect in other languages using PostGIS or Geopandas etc. For example, I don't see many mentions of Polygons or MultiPolygons when I search. However, I do find this page[1] which seems to define similar(?) equivalents. Can I expect equivalent geospatial joins/queries to be available? I don't see many mentions of the types that I would normally do, especially overlay operations[2].

[1]: https://juliaearth.github.io/GeoStatsDocs/stable/domains.htm...

[2]: https://geopandas.org/en/stable/docs/user_guide/set_operatio...


Please read the book. It has all the information in it.


In the preface you list:

- Generate high-performance code

- Specialize on multiple arguments

- Evaluate code interactively

- Exploit parallel hardware

> This list of requirements eliminates Python, R and other mainstream languages used for data science.

Can you elaborate on why/how? Awesome work by the way


Python and R do not generate high performing code. At best they generate calls to high performing code.


> At best they generate calls to high performing code.

It should be noted that this is usually sufficient. But particularly for earth scale problems it can often not be.


Julia is designed to seem to win arguments as best I can tell... If you complain about the need to break abstractions and the lack of general purpose application you're accused of not understanding. When you say it slow they say you can inline assembler, and when you say that's dumb why have a high level language then, they then say well you don't have to it is fast as is and everyone else is slow, and it just devolves into circular arguments. Abstractions exist in layers for reasons.


You can obviously provide the same abstraction with different implementations that yield different performance characteristics. Julia provides the same level of flexibility (if not more) as Python without any of the design decisions which cause Python to be so slow. I fail to see how this is a contentious point.


Yeah this is a great example of what I'm saying. You obviously don't understand Python then.


when you say Julia is slow, what are you talking about? even without any fancy tricks, normal Julia code is usually the same speed as the equivalent normal C code


Yes this is another great example... If Julia is fast then why do you need to inline assembler ?


For the same reason that C/C++ allow inline assembly? Languages come in roughly 3 speeds. Slow (e.g. python/R), mostly not slow (e.g. Java/Go), and not slow (e.g. C/Rust). If you want actually fast code (e.g. the speed of BLAS/FFTW etc) you need the combination of a not slow language, code generation, and often hand-coded assembly for the most performance critical parts.


I noticed you didn't mention Julia explicitly this time because when you outline the abstractions like this it seems silly to claim something about Julia magically solves the position and purpose of these layers. I can write a PySpark job based on a tutorial that would run circles around a single core Julia process that was designed with contradictory requirements. I just don't see how Julia gets away with claiming it solves all of this in the first page of their documentation without a ton of qualifiers... Except to say that is Julia that's what they do, they make bold claims that obfuscate what performance is and where it comes from.


To be explicit about where julia fits in here, Julia is a "not slow" language (you could make an argument for it being on the faster end of "mostly not slow" due to GC) that also has enough high level features (higher order functions, macros, memory management, general ease of use) to work as a high level language. You absolutely can write a distributed python codebase that runs faster than single core julia, but doing so will likely be harder than writing the distributed/threaded Julia code that is way faster than PySpark.


Yeah citation needed on that one, but it is a dumb hypothetical on my part to illustrate the problem. Another hypothetical, how do you get junior people to support your inline ASM? If it makes this easy to do, it makes the technical debt that much more rampant.


You seem really hung up on this inline asm thing. It's not like most julia code is just inline assembly. It's something that at a rough estimate, .5% of packages use to let you squeeze out the final drops of performance that then gets wrapped in an API that looks like normal julia code. This isn't any different from C/C++ which also in some low level code bases will have calls to compiler directives or inline assembly.


The way it is brought up in arguments about how if what you're doing in Julia is slow you can put in the ASM directly in arguments leaves me with the impression that is nonetheless a core part of the "faster than <x>" claims at least. And that's a cop out.


If I remember correctly, this came up originally in the context of comparing heavily optimized Julia code to C code that had inline assembly, in which a statement was made that Julia was obviously slower than C because the C code had hand written assembly in it. Julia, like C, sometimes needs hand-coded assembly to achieve maximal speed. 80% of a fast programming language is not having semantics that are fundamentally opposed to speed (i.e. object oriented architectures that require pointer chasing, using arbitrary precision numbers everywhere, or eval semantics that prevent interpreting rather than compiling code). Languages that don't make those kinds of mistakes are "not slow". i.e. if you write similar code in them, you will end up with similar performance to C.


Yeah I guess that illustrates how everyone is talking about different things, I just think Julia shouldn't use that to claim they are not as slow as many other things, or other vague promises of high level abilities that can somehow traverse paradigms to become the right thing to do at the low level. The fact that you've similarly called out specific caveats to the argument further illustrates how... In my opinion... It is just simply intellectually dishonest how the Julia documentation categorizes others and how promoters of the language don't really know what it means to say that Julia is fast, it clearly can be made not to be.


You realize that the specific things I called out are things julia doesn't do right? Those are the things in Java and Python and most other high level languages that prevent them from being fast. Julia's semantics specifically don't do those things.


Honestly I don't know why you're replying, it is not really addressing what I'm talking about and using a lot of the argument styles I'm complaining about.


I used to think so, but I have a function that gets called about a billion times each and every day as new data comes in, and and takes about 0.01 seconds to evaluate (optimizaiton with nlopt). I tried to code it in c (30% speed improvement) python (twice as slow), Julia (about the same speed). Reason is that call has 5 parameters that operate on a vector of length 50 to return a value to minimize. Turns out R is pretty good at such vector calculations.


Is this what you mean by nlopt? https://github.com/stevengj/nlopt

If so, it looks like you're interfacing from R to high-performing code written in C. Isn't that exactly what OP was describing?


no, the function it calls is pure R and that is where the the code spends all its time.


Interesting. Have you published the code and/or benchmarks anywhere? This flies against everything I've read about Julia and R.


No, its 4 lines of code. I just benchmarked for myself. All it is for 2 vectors x and y of average length 50, and 5 parameters, with exponential, addition, and multiplication, and ultimately sum to return to the optimizer. I was also surprised as I expected c to be much faster. And with Rccp, its actually slower than the R, overhead I guess. When I looked into it, apparently R has really fast code for such vector calculations. With julia, admittingly I did not use simd which would likely make it faster.

Now, I generally use Julia for heavy computes, and usually its much faster than R. But not always.

And this little bit of code runs for hours on the largest instance on AWS every day. Why I was looking so speed it up.


If possible I would encourage you to make an MWE and post it on the Julia Discourse - "Julia slower than R for optimization problem" or something like that, there's a good chance that the community will be able to eke out some more performance. Alternatively you might have hit on a case in which Julia itself is currently leaving performance on the table, which would still be helpful for the community to know as being slow is often considered a bug in Julia world.


Great idea, I do lurk there, just might do that next time I look at the code. My hunch is that simd is the low hanging fruit julia brings. But, and I am not an expert, both might end up doing BLAS anyways, which is why they are so similar.


I am willing to concede, but also willing to argue that vector and data frame manipulations in R are calls to optimized code.

Like, R is “what if we made a lisp inspired version of Python built around numpy and pandas and then reversed timed”


I think that is exactly what is happening. Most of my code is much much faster in Julia, and the code is nicer. But R has its moments. Which is good since this particular app has 3K lines, and I do not want to port it to Julia.

And data.tables in R is faster (and I think nicer to write) than DataFrames in Julia. And since data.tables feed my optimization, R still wins.


R can exploit parallel hardware just fine with Parallel, Future and other libraries like Mirai. The problem is that execution speed is going to be a bottleneck for anything large and when you reach some optimizations, maybe R is not the best language to do the job. But it depends a lot on the use case.


I much prefer parallel in R with mclapply() to the Julia implementation of parallel. One of the few areas where I prefer R to julia (other being R data.tables to julia dataframes)


Geospatial Data Science with Julia presents a fresh approach to data science with geospatial data and the Julia programming language. It contains best practices for writing clean, readable and performant code in geoscientific applications involving sophisticated representations of the (sub)surface of the Earth such as unstructured meshes made of 2D and 3D geometries.


Are you a bot? Why did you copy and paste the top paragraph of the linked page?


Seems to be the author and copied it as some form of abstract. @juliohm no need to be doing that.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: