Hacker News new | past | comments | ask | show | jobs | submit login

The union-find data structure / algorithm is useful and a lot of fun.

The goal is a data structure where you can perform operations like "a and b are in the same set", "b and c are in the same set" and then get answers to questions like "are a and c in the same set?" (yes, in this example.)

The implementation starts out pretty obvious - a tree where every element either points at itself or some thing it was merged with. To check if two elements are in the same set, check if they have the same parent. Without analyzing it, it sounds like you'll average findRoot() performance of O(log(n)), worst-case O(n).

There are a couple of simple optimizations you can do to this structure, the type of things that seem like they shouldn't end up affecting asymptotic runtime all that much. The first is that, whenever you find a root, you can re-parent all the nodes you visited on the way to that root, so they'll all be quicker to look up next time. The other is that you keep track of the size of sets, and always make the larger set be the parent of the smaller.

And neither of those actually do anything impressive alone, but if you use both, the algorithm suddenly becomes incredibly fast, with the slowest-growing (non-constant) complexity I've ever heard of: O(the inverse of the Ackermann function(n)). Or, for any reasonable N, O(4 or less).




Agreed, union-find is great.

My favourite usage of it in practice is for solving first-order (syntactic) unification problems. Destructive unification by means of rewriting unbound variables to be links to other elements is a very pretty solution - especially when you consider the earlier solutions such as that of Robinson's unification which, when applied, often involves somewhat error-prone compositions of substitutions (and is much slower).

The fact that the forest of disjoint sets can be encoded in the program values you're manipulating is great (as opposed to, say, the array-based encoding often taught first): makes it very natural to union two elements, chase up the tree to the set representative, and only requires minor adjustment(s) to the representation of program values.

Destructive unification by means of union-find for solving equality constraints (by rewriting unbound variables into links and, thus, implicitly rewriting terms - without explicit substitution) makes for very easy little unification algorithms that require minimal extension to the datatypes you intend to use and removing evidence of the union-find artefacts can be achieved in the most natural way: replace each link with its representative by using find(l) (a process known as "zonking" in the context of type inference algorithms).

Basic demonstration of what I'm talking about (the incremental refinement of the inferred types is just basic union-find usage): https://streamable.com/1jyzg8


I remember Ravelin shared their story of using it in card fraud detection code.

Basically they managed to replace a full-fledged graph database with small piece of code using union-find in Go. Was a great talk.

https://skillsmatter.com/skillscasts/8355-london-go-usergrou...


My writeup of union find is in https://dercuano.github.io/notes/incremental-union-find.html. I did not in fact find a way to make it efficiently support incremental edge deletion, which is what I was looking for.


> I did not in fact find a way to make it efficiently support incremental edge deletion, which is what I was looking for.

I don't understand this goal. The interior connections aren't relevant to a union-find structure; ideally you have a bunch of trees of depth 2 (assuming the root is at depth 1), but the internal structure could be anything. That fact immediately means that the consequence of removing an edge is not well-defined - it would separate the two objects joined by the edge, and it would also randomly divide the original connected set into two connected sets, each containing one of those two objects. Which object each "third party" object ended up being associated with would be an artifact of the interior structure of the union-find, which is not known and is constantly subject to change as you use the union-find.

If all you want is to be able to remove a single object from a connected set, on the assumption that your union-find always has the ideal flat structure, that's very easy to do - call find on the object you want to remove, which will link it directly to the root of its structure, and then erase that link.


Union find takes as input some undirected graph <N, E> and internally constructs (and progressively mutates) a directed graph <N, E'> which it uses to efficiently answer queries about whether two nodes n₁, n₂ ∈ N are in the same connected component of <N, E>. It additionally supports incrementally adding edges to E.

My quest was to find a way to incrementally delete edges from E, not E'. You're talking about deleting edges from E', which I agree is not generally a useful thing to do.

For example, your N might be the cells of a maze map, and your E might be the connections between adjacent cells that are not separated by a wall. In that case you can tear down a wall and add a corresponding edge to E. But it would be nice in some cases to rebuild a wall, which might separate two previously connected parts of the maze. I was looking for an efficient way to handle that, but I didn't find one.


I don't think there's a way to extend union-find to do what you want in the general case. You might have more luck starting with a minimum cut algorithm.

For the specific case where you can identify some of your edges are more or less likely to be deleted, though, you can run union-find on only the stable edges, cache that result, and then do the unstable edges. Whenever an unstable edge is deleted, reload the cache and redo all the unstable edges. This works for something like a maze with open/closable doors.


Those are some interesting ideas, thanks! I hadn't thought about trying to apply minimum-cut to the problem!


> Union find takes as input some undirected graph <N, E> and internally constructs (and progressively mutates) a directed graph <N, E'> which it uses to efficiently answer queries about whether two nodes n₁, n₂ ∈ N are in the same connected component of <N, E>.

I don't think this is a valuable way to think about the structure. That's not what it's for.

A union-find is a direct representation of the mathematical concept of an equivalence relation. The input is a bunch of statements that two things are equal. It will then let you query whether any two things are or aren't equal to each other.

This is captured well by the more formalized name "disjoint-set structure". You have a set of objects. You can easily remove an element from the set. But it makes no conceptual sense to try to "separate two of the elements in a set". They're not connected, other than conceptually through the fact that they are both members of the same set.

A union-find is a good tool for answering whether two vertices are in the same connected component of a graph because being in the same connected component of a graph is an equivalence relation, not because the union-find is a graph-related concept. It isn't, and there is no coherent way to apply graph-theoretical concepts to it.

Putting things another way:

You describe a union-find as something used to "efficiently answer queries about whether two nodes n₁, n₂ ∈ N are in the same connected component of [a graph]".

But you focused on the wrong part of that statement. What makes a union-find appropriate for that problem is the words "the same", not the words "connected component".


You are welcome to think about the union-find structure however you prefer to think about it, but I was describing the problem I was trying to solve, for which the correct description of union find I gave above is optimal. Contrary to your mistaken assertion, no contradictions arise from applying graph-theoretical concepts in this way; there is no problem of "coherency". It's just a form of description you aren't accustomed to.

If your way of thinking about union find makes it hard for you to understand the problem I was trying to solve, maybe it isn't the best way to think about it for the purpose of this conversation, even if it is the best way to think about it in some other context.

I'm not claiming my description is the only correct description of union find, just that it's the one that most closely maps to the problem I was trying to solve.

The description can equally well be formulated in the relational terms you prefer: given a relation R, union find can efficiently tell you whether, for any given x and y, (x, y) ∈ R', where R' is the symmetric transitive reflexive closure of R (and is thus the smallest equivalence relation containing R). It efficiently supports incrementally extending R by adding new pairs (a, b) to it.

This is equivalent to my graph-theoretic explanation above except that it omits any mention of E'. However, it is longer, and the relationship to the maze-construction application is less clear. Perhaps you nevertheless find the description clearer.

What I was looking for, in these terms, is a variant of union find that also efficiently supports incrementally removing pairs from R.

Does that help you understand the problem better?

— ⁂ —

In general there is a very close correspondence between binary relations and digraphs, so it's usually easy to reformulate a statement about relations as an equivalent statement about digraphs, and vice versa. But one formulation or the other may be more perspicacious.

More generally still, as you move beyond the most elementary mathematics, you will learn that it's usually a mistake to treat one axiom system or vocabulary as more fundamental than another equivalent axiom system, since you can derive either of them from the other one. Instead of arguing about which one is the right axiom system or vocabulary, it's more productive to learn to see things from both viewpoints, flexibly, because things that are obvious from one viewpoint are sometimes subtle from the other one.


> You are welcome to think about the union-find structure however you prefer to think about it, but I was describing the problem I was trying to solve, for which the correct description of union find I gave above is optimal.

> If your way of thinking about union find makes it hard for you to understand the problem I was trying to solve, maybe it isn't the best way to think about it for the purpose of this conversation, even if it is the best way to think about it in some other context.

I'm not having any problems understanding the problem you were trying to solve. My problems are in the area of understanding why you thought a union-find might be an effective way to address that problem.

I'm telling you that, if you adjust your view of what a union-find is, you will have a better conception of where it can and can't be usefully applied.

> In general there is a very close correspondence between binary relations and digraphs, so it's usually easy to reformulate a statement about relations as an equivalent statement about digraphs, and vice versa. But one formulation or the other may be more perspicacious.

In this case, the relation is symmetric by definition, so you just have graphs. Yes, it's obvious how you can use a graph to describe a relation.

But the entire point of the union-find structure is to discard any information contained in an equivalent graph that isn't contained in the relation. (And there are many equivalent graphs, but only one relation!) The absence of that information is what makes the structure valuable. It shouldn't be a surprise that, if you want to preserve information from a particular graph, you are better served with graph-based algorithms.


Well, that's an interesting way to look at it; I see better where you're coming from now.

But I wasn't trying to get the unmodified data structure to handle deletion efficiently; I was trying to use it as a basis to design a structure that could. I thought I saw a minor modification that achieved this, but then I found out why it doesn't work. That's what my linked note above is about.

More generally, if you only try things that will probably work, you'll never achieve anything interesting.


Can you give some examples where union-find is applied to great benefit?


A while back I had to perform an entity resolution task on several million entities.

We arrived at a point in the algorithm where we had a long list of connected components by ids that needed to be reduced into a mutually independent set of components, e.g. ({112, 531, 25}, {25, 238, 39, 901}, {43, 111}, ...)

After much head banging about working out way to do this that wouldn't lead to an out-of-memory error, we found the Disjoint Set Forest Data Structure and our prayers were answered.

I was very grateful for it!


Exactly the same for me. We had a defect management system, and a defect can be connected to others as dupes. We'd have whole chains of dupes. We used union find to identify them.


One example is connected component analyis on images. In one linear scan of an image (2d or 3d) you can find all connected blobs in an image.


I used it to find and track islands of connected polygons in 3D meshes.


a textbook example of an algorithm that works very well with union-find is the kruskal's algorithm to find the minimum weight spanning tree given a graph. Using union-find improves the time complexity of the algorithm from O(V²) to O(ElogV).

This happens because kruskal's algorithm essentially selects the cheapest edge not already included in our spanning tree that won't cause a cycle. So union-find is able to speed up this potential cycle check which would otherwise be naively quadratic.


Oh, huh, this is serendipitously handy because I need to be doing something like that this weekend (translating from Go to Rust - I already have a reparenting solution using hashes but hopefully this is easier / better.)


I have always wanted to really understand this data structure. Sure, I can follow the analysis with the potential functions and all, but I never really understood how Tarjan came up with the functions in the first place. Does anybody have a resource which intuitively explains the analysis?


This might a long read but it is well-written reader-friendly analysis of Tarjan's proof with chain of reasoning. https://codeforces.com/blog/entry/98275


Robert Sedgewick has a great explanation in his Algorithms book. He has gorgeous diagrams alongside his explanations.


Is it in an old edition? I just downloaded an extremely legitimate version of Algorithms 4th edition but it has no section on union data structure.


It should be in Chapter 1 Section 5 of the 4th edition. It's even in the table of contents.


Sorry, my bad. I searched for the word "disjoint" expecting it to be prominent in this section but somehow this word only appears 3 times in this book non of which are in this section!

Anyway, while the figures in this work were indeed gorgeous, it was not what I was looking for. It did not even contain a non-intuitive analysis of the inverse ackermann complexity, much less an intuitive one!


Here's a writeup I made a while ago: https://www.overleaf.com/read/hjbshsqmjwpg


We use it to wire different outputs and inputs to gates in a zero-knowledge proof circuit!


I have heard stories about annotating the edges with elements from a group rather than using unlabelled edges. Have you heard anything about that?


You can efficiently maintain count/max/sum/hash/etc. for each component. It's occasionally useful.


I wrote a beginner's guide on this topic a while ago https://medium.com/nybles/efficient-operations-on-disjoint-s...


This sounds super useful thanks! I think I'll have something this can be used for, with individual claims of whether two elements are in the same set, then easily getting the largest set of equal items.


If at some point you have a cycle, you can then reparent and remove the cycle, right? This structure in general then can encode transitive relations effectively?

Something like “a and b …” “b and c …” “c and a …”.


> If at some point you have a cycle, you can then reparent and remove the cycle, right?

You'd never have a cycle. The way a cycle would theoretically arise would have to be joining something to its own child. But you don't naively join the two objects you're given - you find their root objects, and join those. If - as would be necessary for a cycle to form - they have the same root, you can return without doing anything.

If we call

    unite(a,b)
    unite(b,c)
    unite(c,a)
then we should end up with this structure:

        c
       / \
      b   a
With parents on the right, the structure will be a-b after the first call and a-b-c after the second call. The reason we end up with a shallower tree after the third call is that when a is passed to unite, we call find on it and it gets reparented directly to c. Then, c (the representative for c) and c (the representative for a) are already related, so we don't adjust the structure further.

> This structure in general then can encode transitive relations effectively?

Well... no. This structure embodies the concept of an equivalence relation. You wouldn't be able to use it to express a general transitive relation, only a transitive relation that is also symmetric and reflexive.


Thanks!


Is there a name for the data structure/algorithm?


It's called union-find, but also disjoint-set sometimes. The process of reparenting nodes is called path compression, and the optimization of using the larger set of the two as the parent when merging is usually just called "union by rank." Any of those terms should give you more details upon search.



I remember getting this in an interview without ever coming across it.. basically, fuck union find.


Wouldn't that be extremely useful and a polynomial solution for 3-SAT, which is NP complete?


I think I understand why you're confused here. To the extent that union-find is similar to SAT, it's similar to 2-SAT: the constraints you're reasoning about are defined only on pairs of elements, whereas 3-SAT is fundamentally reasoning about triplets (at least one of these three variables is true). Notably, 2-SAT is a variant of SAT that is polynomial in time, unlike 3-SAT.


SAT complexity results from blowing up a reasonable number of variables/clauses (very rarely equivalent to others) in the formula to an enormous number of variable assignments (too many to store, and having only three classes, formula true or false or not evaluated so far, that are not useful equivalencies). What disjoint sets are you thinking of tracking?


How is it a solution to 3-SAT?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: