Unless I really didn't want to introduce dependencies, or reduce code size, I think I'd use an off the shelf hash table implementation these days. It's still a fun exercise building your own though.
One important consideration is missing from this benchmark: how they behave with multi-threading.
There can be big differences between diffent hash table implementation when you have to use them from different threads (built-in smart thread safety vs external dumb locks, etc)
That has nothing to do with this. Granular lock free tables will outperform global locking by orders of magnitude, but it has to be designed differently from the ground up.
I really like the API design of libjudy for general hash table like interfaces. It prevents you from having to hash a key twice just to do "check if key is not present, then set value, otherwise leave original value in hash." The pattern also meshes better with the way iterators naturally work.
Also, in terms of this table, if you add the computed key hash value to the stored hash entry, you can check that the value of the hashes match before you do the full strcmp. If you have a weak hash you might also get a benefit from checking that the first characters match before calling the full strcmp.
It would also make rehashing easier since you already have the full key available to you and don't have to use your internal set function to move entries into the new table. In the posted implementation the big-O semantics of a rehash are worst case.
Anyways.. "man hsearch.3" if you want something cheap and easy.
POSIX hsearch tables are terrible. There's a reason why nobody uses them.
One at a time per program, having to know how big the table needs to be at initialization time with no automatic resizing, an API that's very limited (no way to remove entries, iterate over entries, tell how many entries are in the table, ...)
Reentrant versions have existed for quite some time.
> having to know how big the table needs to be at initialization time with no automatic resizing
The posted implementation resizes by creating an entirely new table and then moving all entries. The exact same mechanism is available with hsearch.
> no way to remove entries, iterate over entries
Strong downside but the posted implementation has no removal function either. It also leaks key memory on resize and free. It's iterator would need modification to allow delete during iteration.
>Reentrant versions have existed for quite some time.
Not in POSIX. glibc has one, but that doesn't do much good if you want your code to work portably. I do see they've made their way into Free and Net BSD, but not Open. Are they available for Mac? Any of the surviving commercial unixes?
>> having to know how big the table needs to be at initialization time with no automatic resizing
>The posted implementation resizes by creating an entirely new table and then moving all entries. The exact same mechanism is available with hsearch.
That's how resizing a hash table typically works, yes. Except it's normally done automatically and transparently by the hash table code. hsearch doesn't do that. And since it doesn't give you a way to iterate over values in the table, you'd have to keep a separate list of everything you add to it in order to manually delete the existing table and make a bigger one and re-add everything to it... Like I said, it's terrible.
I was noodling in this area recently, trying to speed up some code similar to the tr utility:
$ echo abcdef |tr abc ghi
ghidef
For an eight-bit character set, I found that building an array to map every character improved on linear search, even for short replacement strings and relatively short input strings.
There isn't as easy a win for Unicode, so I played with some simple hash tables. Although the conventional wisdom is to use the high-order bits of a hash function's output, FNV-1a is not so good for short inputs of one or two bytes. (I used the 32-bit variant.) It was better just to use the low-order bits.
I enjoy reading posts about C just to see how much boilerplate and footguns higher languages let you avoid. I would like to write C, but I just have no trust I would remember or even know how to do things right (use `calloc` vs `malloc`, free the table when allocation of `entries` fails etc).
Counterpoint: when you're writing C, you end up only implementing as much complexity as you actually need. I rolled a hash table for a project that only need to contain integer keys and (I think) string values. It didn't need to carry around all the complexity to handle hashing arbitrary structured data for keys, etc. Multi-threading was supported but a simple lock over the whole table was sufficient for correctness and met the required performance.
The whole thing was perhaps 200 lines of straightforward code. If you don't need to carry around support for all the third-prime-numbered-Tuesday-of-the-month scenarios that general purpose libraries have to, you end up with code you can understand fully and don't have to spend time reasoning about complexity that doesn't exist.
The thing about boilerplate is that you were always rewriting it, so avoiding the footguns became habit.
(So ... which takes longer, writing the data structure you want, as we all did then, or searching for the perfect library, as some people do now?)
EDIT: awk(1) had built-in hash tables, and dates from 1977. Note that generic hashing becomes much more useful after D-space gets larger than 16-64k! (14-16 bits of address)
I find that stuff enjoyable. I almost never use malloc, I learnt to use calloc so it's my go-to allocator.
Whenever I write an allocation I try to write a free after it, then fill in the code in between.
Use of static analysers and sanitisers helps too.
Like some people enjoy making all their hand tools from scratch in their workshop, there's a craft and discipline to follow and a satisfaction in doing things well, I enjoy the craft of and discipline C for the same reason.
Enjoyable article and thorough run through. I didn't read all of it, but it really took me back to learning 'C' in comp sci in the early 90s. Great times.
> but it is non-ideal that I’m only allowing half the range of size_t.
I am fairly certain that in C it's actually impossible to have an object whose size is larger than half of the range of size_t unless ptrdiff_t is wider than size_t, which normally isn't. Unless, of course, C standard decided to make subtracting two valid pointers into the same array (or one past the end) a potential UB, just because.
> If the result is not representable in an object of that type, the behavior is undefined. In other words, if the expressions P and Q point to, respectively, the i-th and j-th elements of an array object, the expression (P)-(Q) has the value i - j provided the value fits in an object of type ptrdiff_t.
Although, I can't help thinking you're going to run into bigger problems than that if the size of your buffer has grown to 9,223,372,036,854,775,807 bytes.
Well, but what if you're working with a 16-bit microcontroller? Apparently, there were quite some heated discussions about that which is why, while size_t was required to be at least 16 bits, ptrdiff_t was required to be at least 17 bits, in C99... and then it was reverted back to 16 bits in C23. See e.g. [0]
This trade-off is actually making me sad. One can either have all the sizes and offsets fit into (unsigned/signed) integers, or one can have objects spanning more than half of the memory space. I think for my language I'd pick "all sizes/offsets are signed, and so yeah, you can't span all of the memory with a single char array".
... although I can't help thinking you have larger problems if you have two objects larger than 16k in the same array in a machine that has at most 64k of memory. :-P Or a char array greater than 32k for that matter.
I do appreciate the theoretical horror, but thank you anyway, C23! Having to cast your size_t's to ptrdiff_t's before you subtract them in order to preserve portability seems ... impractical.
You always could subtract two size_t's: they're unsigned, after all, and such behaviour always have been defined, and the result is also unsigned. But if you want to subtract pointers, apparently you better convert them to uintptr_t first.
lol. You can't always subtract two size_t's. (size_t)3 - (size_t)4. And (ptrdiff_t)((size_t)3-(size_t)4) actually would be undefined behavior in C++20!
The "(size_t)3 - (size_t)4" is defined to be the same as "-(size_t)1", which is defined, and the same as "(size_t)1 + ~(size_t)1". Also, "(ptrdiff_t)((size_t)3-(size_t)4)" is defined both before and after C++20:
If the destination type is signed, the value does not change if the source integer can be represented in the destination type. Otherwise the result is [implementation-defined (until C++20)] [the unique value of the destination type equal to the source value modulo 2^n where n is the number of bits used to represent the destination type (since C++20)] (note that this is different from signed integer arithmetic overflow, which is undefined).
However, the value of (ptrdiff_t)((size_t)3-(size_t)4) is positive if ptrdiff_t has at least 17 bits, and size_t only has 16 bits, and negative whenever ptrdiff_t and size_t have the same number of bits.
Open hash tables are typically the way to go in C. They drop the added complexity of managing a forward singly list by turning getters and setters typically into what is an array operation
Casting between void pointers and function pointers is, strictly speaking, undefined behavior. Though it does tend to work, it's not something to rely on.
(And you don't need casts when converting between void pointers and data pointers)
Remember that this is for c, not c++. Void pointers are entirely unnecessary in c++ unless you’re interoperating with existing libraries and can’t change the interface. You’ll get faster and safer code with templates and function overloading
From my experience, learning a language like C goes beyond what you cover in University. Either that or the one I went to was **.
I remember back in the late 90s to early 00s, I had fun countering OOP guys claiming C "cannot do XYZ" - but it obviously can.. the C way.
For method override, my mind was blown when I discovered function pointers!
It was my "Wait. What?" moment.. along with various other things!
Below is sample code (not tested)
// generic monster "class"
struct monster_t {
int health;
struct weapon_t *weapon;
// etc..
void (*attack)(struct monster_t *self); // our function pointer
// more function pointers for update, die, hit, etc...
};
// impl
void monster_notassigned_attach(struct monster_t *monster) { // safety net (todo)
printf("Attack not assigned to monster...\n");
}
void monster_dragon_attack(struct monster_t *monster) {
// attack for dragon
printf("Dragon is attacking you...\n");
}
void monster_wizard_attack(struct monster_t *monster) {
// attack for wizard
printf("Wizard is attacking you...\n");
}
// creation
void monster_dragon_create(struct monster_t *monster) {
// allocate, then assign.. and do...
monster->attack = monster_dragon_attack;
}
void monster_wizard_create(struct monster_t *monster) {
// allocate, then assign.. and do...
monster->attack = monster_wizard_attack;
}
void main() {
// testing...
struct monster_t monsters[10]; // can store various monsters!
monster_dragon_create(&monsters[0]);
monster_wizard_create(&monsters[1]);
for(int i = 0; i < 2; ++i) {
monsters[i].attack(&monsters[i]); // yeah you have to pass it in...
}
}
Not suggesting I support or follow the above especially for games. It is just a proof of concept!
I always tell people that coding in C (or any non-OOP language) should not be done the way you would in Java, C#, etc. As mentioned, the above is a demonstration.
This is a bit of a pet peeve with anonymous Internet writing, but: are we supposed to know when you were at university? Do you know when I was? Why not state something like the number of decades or whatever?
And yeah, a void * is how you express pointer to data of unknown type in C.
> This is a bit of a pet peeve with anonymous Internet writing, but: are we supposed to know when you were at university? Do you know when I was? Why not state something like the number of decades or whatever?
Does it matter? I tend to read it as "haven't touched it/had any use for it since university".
Unless I really didn't want to introduce dependencies, or reduce code size, I think I'd use an off the shelf hash table implementation these days. It's still a fun exercise building your own though.