How to implement a hash table in C (2021)

commandersaki · 2024-07-06T06:15:33 1720246533

There was a fantastic benchmark of C and C++ hash tables doing the rounds a few weeks ago, it's pretty fun reading: https://jacksonallan.github.io/c_cpp_hash_tables_benchmark/.

Unless I really didn't want to introduce dependencies, or reduce code size, I think I'd use an off the shelf hash table implementation these days. It's still a fun exercise building your own though.

fenesiistvan · 2024-07-06T08:30:19 1720254619

One important consideration is missing from this benchmark: how they behave with multi-threading. There can be big differences between diffent hash table implementation when you have to use them from different threads (built-in smart thread safety vs external dumb locks, etc)

CyberDildonics · 2024-07-07T02:07:33 1720318053

That has nothing to do with this. Granular lock free tables will outperform global locking by orders of magnitude, but it has to be designed differently from the ground up.

akira2501 · 2024-07-06T05:22:39 1720243359

I really like the API design of libjudy for general hash table like interfaces. It prevents you from having to hash a key twice just to do "check if key is not present, then set value, otherwise leave original value in hash." The pattern also meshes better with the way iterators naturally work.

Also, in terms of this table, if you add the computed key hash value to the stored hash entry, you can check that the value of the hashes match before you do the full strcmp. If you have a weak hash you might also get a benefit from checking that the first characters match before calling the full strcmp.

It would also make rehashing easier since you already have the full key available to you and don't have to use your internal set function to move entries into the new table. In the posted implementation the big-O semantics of a rehash are worst case.

Anyways.. "man hsearch.3" if you want something cheap and easy.

shawn_w · 2024-07-06T08:42:12 1720255332

POSIX hsearch tables are terrible. There's a reason why nobody uses them.

One at a time per program, having to know how big the table needs to be at initialization time with no automatic resizing, an API that's very limited (no way to remove entries, iterate over entries, tell how many entries are in the table, ...)

akira2501 · 2024-07-06T09:09:48 1720256988

> One at a time per program

Reentrant versions have existed for quite some time.

> having to know how big the table needs to be at initialization time with no automatic resizing

The posted implementation resizes by creating an entirely new table and then moving all entries. The exact same mechanism is available with hsearch.

> no way to remove entries, iterate over entries

Strong downside but the posted implementation has no removal function either. It also leaks key memory on resize and free. It's iterator would need modification to allow delete during iteration.

> tell how many entries are in the table

somewhat_reasonable_point++;

shawn_w · 2024-07-06T17:04:28 1720285468

>> One at a time per program

>Reentrant versions have existed for quite some time.

Not in POSIX. glibc has one, but that doesn't do much good if you want your code to work portably. I do see they've made their way into Free and Net BSD, but not Open. Are they available for Mac? Any of the surviving commercial unixes?

>> having to know how big the table needs to be at initialization time with no automatic resizing

>The posted implementation resizes by creating an entirely new table and then moving all entries. The exact same mechanism is available with hsearch.

That's how resizing a hash table typically works, yes. Except it's normally done automatically and transparently by the hash table code. hsearch doesn't do that. And since it doesn't give you a way to iterate over values in the table, you'd have to keep a separate list of everything you add to it in order to manually delete the existing table and make a bigger one and re-add everything to it... Like I said, it's terrible.

SloopJon · 2024-07-06T19:50:40 1720295440

I was noodling in this area recently, trying to speed up some code similar to the tr utility:

    $ echo abcdef |tr abc ghi
    ghidef

For an eight-bit character set, I found that building an array to map every character improved on linear search, even for short replacement strings and relatively short input strings.

There isn't as easy a win for Unicode, so I played with some simple hash tables. Although the conventional wisdom is to use the high-order bits of a hash function's output, FNV-1a is not so good for short inputs of one or two bytes. (I used the 32-bit variant.) It was better just to use the low-order bits.

iamevn · 2024-07-06T08:32:15 1720254735

if (size + size < size) {

I know size_t wraps around on overflow so this would actually work, but this still makes me do a double take.

suprjami · 2024-07-06T05:12:53 1720242773

It's the ihih guy! Thanks for being one of the few C bloggers out there.

brabel · 2024-07-06T07:53:55 1720252435

I enjoy reading posts about C just to see how much boilerplate and footguns higher languages let you avoid. I would like to write C, but I just have no trust I would remember or even know how to do things right (use `calloc` vs `malloc`, free the table when allocation of `entries` fails etc).

mauvehaus · 2024-07-06T11:04:09 1720263849

Counterpoint: when you're writing C, you end up only implementing as much complexity as you actually need. I rolled a hash table for a project that only need to contain integer keys and (I think) string values. It didn't need to carry around all the complexity to handle hashing arbitrary structured data for keys, etc. Multi-threading was supported but a simple lock over the whole table was sufficient for correctness and met the required performance.

The whole thing was perhaps 200 lines of straightforward code. If you don't need to carry around support for all the third-prime-numbered-Tuesday-of-the-month scenarios that general purpose libraries have to, you end up with code you can understand fully and don't have to spend time reasoning about complexity that doesn't exist.

Horffupolde · 2024-07-06T18:32:05 1720290725

Until it does and you do and you don’t.

082349872349872 · 2024-07-06T08:26:31 1720254391

The thing about boilerplate is that you were always rewriting it, so avoiding the footguns became habit.

(So ... which takes longer, writing the data structure you want, as we all did then, or searching for the perfect library, as some people do now?)

EDIT: awk(1) had built-in hash tables, and dates from 1977. Note that generic hashing becomes much more useful after D-space gets larger than 16-64k! (14-16 bits of address)

suprjami · 2024-07-06T22:17:57 1720304277

I find that stuff enjoyable. I almost never use malloc, I learnt to use calloc so it's my go-to allocator.

Whenever I write an allocation I try to write a free after it, then fill in the code in between.

Use of static analysers and sanitisers helps too.

Like some people enjoy making all their hand tools from scratch in their workshop, there's a craft and discipline to follow and a satisfaction in doing things well, I enjoy the craft of and discipline C for the same reason.

rlinge · 2024-07-06T07:59:42 1720252782

I started to make some new toys in C too:

https://github.com/webd90kb/webd/tree/master/codes/c_project...

https://github.com/webd90kb/webd/tree/master/codes/scripts/e...

MrVandemar · 2024-07-06T02:43:07 1720233787

Enjoyable article and thorough run through. I didn't read all of it, but it really took me back to learning 'C' in comp sci in the early 90s. Great times.

Joker_vD · 2024-07-06T10:58:41 1720263521

> but it is non-ideal that I’m only allowing half the range of size_t.

I am fairly certain that in C it's actually impossible to have an object whose size is larger than half of the range of size_t unless ptrdiff_t is wider than size_t, which normally isn't. Unless, of course, C standard decided to make subtracting two valid pointers into the same array (or one past the end) a potential UB, just because.

tedunangst · 2024-07-07T03:56:15 1720324575

Actually, that is UB.

> If the result is not representable in an object of that type, the behavior is undefined. In other words, if the expressions P and Q point to, respectively, the i-th and j-th elements of an array object, the expression (P)-(Q) has the value i - j provided the value fits in an object of type ptrdiff_t.

rerdavies · 2024-07-07T08:20:53 1720340453

Although, I can't help thinking you're going to run into bigger problems than that if the size of your buffer has grown to 9,223,372,036,854,775,807 bytes.

Joker_vD · 2024-07-07T09:04:44 1720343084

Well, but what if you're working with a 16-bit microcontroller? Apparently, there were quite some heated discussions about that which is why, while size_t was required to be at least 16 bits, ptrdiff_t was required to be at least 17 bits, in C99... and then it was reverted back to 16 bits in C23. See e.g. [0]

This trade-off is actually making me sad. One can either have all the sizes and offsets fit into (unsigned/signed) integers, or one can have objects spanning more than half of the memory space. I think for my language I'd pick "all sizes/offsets are signed, and so yeah, you can't span all of the memory with a single char array".

[0] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2808.htm

rerdavies · 2024-07-12T13:00:14 1720789214

Interesting. I hadn't thought of that.

... although I can't help thinking you have larger problems if you have two objects larger than 16k in the same array in a machine that has at most 64k of memory. :-P Or a char array greater than 32k for that matter.

I do appreciate the theoretical horror, but thank you anyway, C23! Having to cast your size_t's to ptrdiff_t's before you subtract them in order to preserve portability seems ... impractical.

Joker_vD · 2024-07-12T15:40:32 1720798832

You always could subtract two size_t's: they're unsigned, after all, and such behaviour always have been defined, and the result is also unsigned. But if you want to subtract pointers, apparently you better convert them to uintptr_t first.

rerdavies · 2024-07-22T23:31:15 1721691075

lol. You can't always subtract two size_t's. (size_t)3 - (size_t)4. And (ptrdiff_t)((size_t)3-(size_t)4) actually would be undefined behavior in C++20!

Joker_vD · 2024-07-24T14:03:59 1721829839

The "(size_t)3 - (size_t)4" is defined to be the same as "-(size_t)1", which is defined, and the same as "(size_t)1 + ~(size_t)1". Also, "(ptrdiff_t)((size_t)3-(size_t)4)" is defined both before and after C++20:

    If the destination type is signed, the value does not change if the source integer can be represented in the destination type. Otherwise the result is [implementation-defined (until C++20)] [the unique value of the destination type equal to the source value modulo 2^n where n is the number of bits used to represent the destination type (since C++20)] (note that this is different from signed integer arithmetic overflow, which is undefined).

rerdavies · 2024-07-30T09:02:56 1722330176

However, the value of (ptrdiff_t)((size_t)3-(size_t)4) is positive if ptrdiff_t has at least 17 bits, and size_t only has 16 bits, and negative whenever ptrdiff_t and size_t have the same number of bits.

dang · 2024-07-06T06:22:38 1720246958

Discussed at the time:

How to implement a hash table in C - https://news.ycombinator.com/item?id=26590234 - March 2021 (156 comments)

foresto · 2024-07-06T21:09:17 1720300157

In case anyone here is not already familiar with it, gperf is a perfect hash function generator.

https://www.gnu.org/software/gperf/manual/gperf.html

glouwbug · 2024-07-07T15:14:30 1720365270

Open hash tables are typically the way to go in C. They drop the added complexity of managing a forward singly list by turning getters and setters typically into what is an array operation

aronhegedus · 2024-07-06T20:36:11 1720298171

This came up on an interview test where I had to implement this from scratch! Fun to read

a-dub · 2024-07-06T05:43:54 1720244634

might be neat if there were acceleration wins to be had with instructions for computing hash ints from machine ints and strings.

or even better, a full fledged linear probing hash lookup instruction.

timo-e-aus-e · 2024-07-06T05:35:36 1720244136

I played around with C++ when I was at university. Then never touched it again. So, with a grin I stumble over things like

"void* ht_get(...)"

Wait. What? A void pointer? Interesting... I have no clue.

I like articles like these. For someone not familiar with C it's a perfect level. In terms of explanation and the code itself.

floam · 2024-07-06T05:45:13 1720244713

It has no type, so you can cast it to whatever it actually is. Even a function you will call.

The main use case is generic functions. And in data structures like this.

shawn_w · 2024-07-06T08:31:17 1720254677

Casting between void pointers and function pointers is, strictly speaking, undefined behavior. Though it does tend to work, it's not something to rely on.

(And you don't need casts when converting between void pointers and data pointers)

elteto · 2024-07-06T10:50:38 1720263038

The dlopen/dlsym API uses void *, there’s really no alternative in C.

It might be undefined in the sense that a language lawyer might care, but for any sane implementation it’s completely valid.

HexDecOctBin · 2024-07-06T08:56:20 1720256180

AFAIK it's undefined behaviour in ISO C, but not so in POSIX.

maccard · 2024-07-06T11:03:46 1720263826

c11 fixed this. C++ allows casting as long as you don’t access the intermediate casted item and cast it back to the type it started as.

In practice though, it works pretty much everywhere in my experience.

maccard · 2024-07-06T11:01:12 1720263672

Remember that this is for c, not c++. Void pointers are entirely unnecessary in c++ unless you’re interoperating with existing libraries and can’t change the interface. You’ll get faster and safer code with templates and function overloading

masfoobar · 2024-07-08T10:52:36 1720435956

From my experience, learning a language like C goes beyond what you cover in University. Either that or the one I went to was **.

I remember back in the late 90s to early 00s, I had fun countering OOP guys claiming C "cannot do XYZ" - but it obviously can.. the C way.

For method override, my mind was blown when I discovered function pointers!

It was my "Wait. What?" moment.. along with various other things!

Below is sample code (not tested)

  // generic monster "class"
  struct monster_t {
    int health;
    struct weapon_t *weapon;
    // etc..

    void (*attack)(struct monster_t *self);   // our function pointer
    // more function pointers for update, die, hit, etc...
  };

  // impl
  void monster_notassigned_attach(struct monster_t *monster) {  // safety net (todo)
    printf("Attack not assigned to monster...\n");
  }

  void monster_dragon_attack(struct monster_t *monster) {
    // attack for dragon
    printf("Dragon is attacking you...\n");
  }

  void monster_wizard_attack(struct monster_t *monster) {
    // attack for wizard
    printf("Wizard is attacking you...\n");
  }


  // creation
  void monster_dragon_create(struct monster_t *monster) {
    // allocate, then assign.. and do...
    monster->attack = monster_dragon_attack;
  }

  void monster_wizard_create(struct monster_t *monster) {
    // allocate, then assign.. and do...
    monster->attack = monster_wizard_attack;
  }
  
  void main() {
    // testing...
    struct monster_t monsters[10];       // can store various monsters!
  
    monster_dragon_create(&monsters[0]);
    monster_wizard_create(&monsters[1]);  

    for(int i = 0; i < 2; ++i) {
      monsters[i].attack(&monsters[i]);  // yeah you have to pass it in...
    }
  }

Not suggesting I support or follow the above especially for games. It is just a proof of concept! I always tell people that coding in C (or any non-OOP language) should not be done the way you would in Java, C#, etc. As mentioned, the above is a demonstration.

unwind · 2024-07-06T07:00:28 1720249228

This is a bit of a pet peeve with anonymous Internet writing, but: are we supposed to know when you were at university? Do you know when I was? Why not state something like the number of decades or whatever?

And yeah, a void * is how you express pointer to data of unknown type in C.

kchr · 2024-07-06T09:43:49 1720259029

> This is a bit of a pet peeve with anonymous Internet writing, but: are we supposed to know when you were at university? Do you know when I was? Why not state something like the number of decades or whatever?

Does it matter? I tend to read it as "haven't touched it/had any use for it since university".