Myths about Hash Tables

Udo · on Sept 30, 2016

While the article is a bit polemic, I really wish more people knew how hash tables work under the hood.

When I was looking for a job, I flunked out of one interview where I got into an argument with an interviewer about the runtime properties of hash tables - more specifically: the interviewer was adamant their access time was always O(1), wanted me to admit my "mistake" and move on, but I just couldn't ;) It was one of those rare cases where they give you feedback on why you're not being hired, too. The lady on the phone said my "technical knowledge" was not up to par with what they needed for the position. Yeah, I'm still salty about that.

grosbisou · on Sept 30, 2016

These are my favorite interviews. I can't just let the interviewers go when I sense some bullshit... I mean if you want to drill people on something it should be expected you know everything about it. No surprise I fail most interviews!

Like the CTO of a famous startup who didn't understand that the order of a multicolumn index has some importance. I died a few minutes later when he told me I should add an index on a boolean (whereas the queries didn't use it) because it would speed up the queries by 2. Best day of my life.

arethuza · on Sept 30, 2016

Sounds to me more like the interviewer/company flunked you're test rather than you flunking - you might have dodged a bullet there.

Udo · on Sept 30, 2016

I wouldn't have worked for the interviewer, it's often just a gatekeeper you need to overcome. Also, when you don't know how to make rent next month and the company seems pretty cool (apart from that person), it's hard to see at the time how you dodged any bullets...

arethuza · on Sept 30, 2016

If someone reacts badly when you know more than they do then that's not a great bit of behaviour, if a company chooses such a person as an interviewer then that's a bit of a red flag to me - an interview should be as much about trying to present a positive image of the company as much as testing the interviewer.

Sorry to hear about the rent situation - I take that worked out OK eventually?

Udo · on Sept 30, 2016

> if a company chooses such a person as an interviewer then that's a bit of a red flag to me

I think the basic corporate rules of signalling competence apply here. Assuming that person was both technologically and socially incompetent (which might be too harsh of a judgement), some people still have a way of getting promoted into valuable positions by appearing to do a great job.

> Sorry to hear about the rent situation - I take that worked out OK eventually?

Thankfully that was some years back, I'm doing okay today :)

arethuza · on Sept 30, 2016

"which might be too harsh of a judgement"

Not sure it is - the usual advice is "if in doubt then don't hire" so why not apply that to employers as well as employees?

Of course, if there is an element of desperation involved then I can understand it. But I have walked away from interview processes (or turned down offers) when I had some worries and never really regretted it.

Harshness both ways would seem to be fair.

marklgr · on Sept 30, 2016

Something could be said for being able to convincingly explain why you're right to your coworkers/manager, without hurting their feelings (even though they might be stubborn about it at first). That's a precious soft skill.

quickben · on Sept 30, 2016

The rule is that you ignore the constant. O(5) is still O(1).

It is not 1 but that's the agreement.

In your specific case maybe you both weren't aware of the nomenclature? In any case, they did you a favor. If you are arguing about hash tables from the start, who knows how it will be when you ask for vacation or sick days.

Udo · on Sept 30, 2016

> In your specific case maybe you both weren't aware of the nomenclature?

Well, "always O(1)" is simply wrong in my opinion. This may be a cultural thing, but I'm used to referring to the worst case behavior in big O as a default, so I would have said something along the lines of "the worst case is O(n) and that can turn out to be problematic in some cases". And normally interviewers do want you to talk about these kind of corner cases. I don't think I'm using the nomenclature wrong, and I certainly didn't ever refer to something as O(5) which as far as I can tell is not and should never a thing...

To my best recollection, the question came up as a dynamic language issue. Specifically what the interviewer wanted to hear is that accessing (both read and write) an object property in JavaScript is always O(1), since it's a hash map access under the hood. Leaving aside for a moment that this is fuzzily untrue in a typically JITed environment, it's also untrue in general. You could try to make a case for that when doing a read on a specially optimized hash table, but it would certainly not be true in the general case where you also add things to the table. But all of these considerations are already way too detailed and far beyond what the interviewer wanted to talk about.

They just wanted to check off "knows it's a hash table under the hood" and "knows that hash tables are always O(1)", even though both of these are untrue in my opinion.

lorantt · on Sept 30, 2016

O(1) is usually prefixed with "amortized" or "average worst-case", which is where the difference in interpretation comes from imo. I found this explanation a bit wishy washy on some details but good (and has an appropriate appeal to authority/original sources).

https://m.reddit.com/r/compsci/comments/2z74z8/why_is_hashta...

sp332 · on Sept 30, 2016

Yup, pathological worst-case can be annotated with an omega Ω(n).

quickben · on Oct 1, 2016

There are three curves. Big O is simply the average best fit. You also have max worst and min best.

JavaScript does (unless I'm wrong) a layer of variable lookup, so the time is far off from O(1).

So, you are right on technical terms, and on interview terms, it depends. When I was just out of university I'd have probably went for the job. As I'm getting more senior, I'd still be polite and explain them where they are wrong, but stand my ground with the recruiter afterwards and tell him the interviewer was out of his depth, and checking off questions he probably googled somewhere. I'd then ask him to give me more serious companies to be considered by. Stuff like your story have more to do with team skills then technical skills. Who cares if you answer right or wrong on the question, when you are the guy you are supposed to work with get pissed off with each other after few minutes technical discussion.

General advice: go for where you are celebrated and not for where you are tolerated. A bit of synergy can go a long way.

zzzcpan · on Sept 30, 2016

Isn't it too implementation specific to simply claim worst-case as O(1) or O(n)? Depending on how much memory is acceptable to waste and how fast insertions should be worst-case could range from O(1) to O(n).

Udo · on Sept 30, 2016

I would not consider a table where the data size of the hashed keys equals the data size of the unhashed keys a hash map. If you got, say, a u8 as a key, and your lookup table uses u8s for keys internally, that's not what most people would refer to as a hash table. It's simply a lookup array.

In all the usual cases, the hashed key size is way less than the unhashed key size, and you need to pay the price for collisions somewhere.

zzzcpan · on Sept 30, 2016

No, if you rebuild on collision you waste just a bit more memory, but get O(1) worst case look-ups. It's still a hash table and not a table that maps an entire key space. See how dynamic perfect hashing works.

Udo · on Sept 30, 2016

Read what I wrote. Rebuilding is not free.

zzzcpan · on Sept 30, 2016

There are various trade-offs, as I said, but getting O(1) worst-case look-ups is not hard. Only naive implementations could corner you into O(n) worst-case.

I'm sorry, but you don't really understand hash tables.

Udo · on Sept 30, 2016

Read things more carefully before rushing to judgement. I'm not talking about a lookups-only scenario. Of course you can make anything O(1) as long as you arbitrarily define all the parts that aren't as external to the solution, but you're just kidding yourself in the end.

> I'm sorry, but you don't really understand hash tables.

I'm not the one who made this personal, but I'll match that allegation back to you, and I raise you one "you don't really know how to interpret what people say".

zzzcpan · on Sept 30, 2016

I'm sorry, I didn't mean to make it personal.

Udo · on Sept 30, 2016

Honestly, I think implying that rebuilding the hash table is an O(1) operation is worse. :P But judging by the downvotes I'm getting on that particular comment, it seems that most people agree with you that rebuilding is indeed free. So it seems you won this argument on several fronts ;)

treebeard901 · on Sept 30, 2016

Plot twist: zzzcpan was the interviewer.

Udo · on Sept 30, 2016

I'd call that a Seinfeldian twist ;)

sp332 · on Sept 30, 2016

It's only O(1) until it gets full. Either you have hash buckets which take linear time to search through, or you do a linear probe like in the article. And if it outgrows hardware cashes, performance gets worse as O(log(n)), but that's not a feature of the algorithm itself.

hvidgaard · on Sept 30, 2016

It would be fairly simple to create an argument he cannot refute. Make a hashtable with 4 slots, put 16 elements in it, even with perfect distribution it would have to look through 4 elements to find the last one. This means it takes O(1/4n), or, since constants doesn't matter O(n). If he cannot accept that I would simply walk out and inform them that I could not take the job seriously, when the interviewer don't know such a basic thing.

hthh · on Sept 30, 2016

Dynamic resizing fixes this, although hash collisions can still be a problem (albeit a very unlikely one if your hash is secure). https://en.wikipedia.org/wiki/Hash_table#Dynamic_resizing

mason55 · on Sept 30, 2016

The number of collisions is only loosely based on the hash function, it's much more closely tied to the fill factor.

If you keep your hash table 1% full then you're unlikely to run into a collision no matter how bad your hash function is. If you keep your table 99% full then it doesn't matter how good your hash function is. You still have to modulo your hash by the number of buckets.

wongarsu · on Sept 30, 2016

Yes, there are implemenations of hash tables that solve this problem. Hash collisions still ruin your worst case performance, but as long as nobody is asking how long you spend on inserts you can solve that too. As a trivial example, on each insert make the hashtable larger and change the hash function randomly until you found a solution that has O(1) lookup for all elements.

But now we're far outside the realm of "properties of hash tables" and instead inside "properties of my hash table", and any decent interviewer should recognize that.

lorantt · on Sept 30, 2016

I think people here seem to implicitly assume linked buckets, those are bad on modern architectures for several reasons.

Look at (Hopscotch,) Robin Hood or Cuckoo Hashing for hashing with linear probing, high fill factor (~0.9) and _amortized_ O(1). I've seen a paper somewhere that proved Robin Hood worst-case O(log log n) afair.

http://www.sebastiansylvan.com/post/robin-hood-hashing-shoul...

https://en.m.wikipedia.org/wiki/Hopscotch_hashing

You can make open probing performantly concurrent with 2 bytes of memory overhead per key/value too:

http://preshing.com/20160314/leapfrog-probing/

hvidgaard · on Oct 3, 2016

My first tought about dynamic resizing is that you can have an amortized O(1) but still is O(n) worst case. Both then exploiting the hash function to force all elements into the same bucket, but also when triggering a resize.

You can make an expanding array with O(1) operations, but I'm unsure if it's applicable to a hashtable. It may be, but it's beginning to become very complex.

seanwilson · on Sept 30, 2016

> Engineers irrationally avoid hash tables because of the worst-case O(n) search time.

Not sure about other people here but I haven't heard any developers I work with avoid hash tables for reasons like this. I find most coders treat hash tables as black boxes that offer instant lookups.

Also, I found from giving interviews most developers can't explain at a basic level how a hash table works as well as other fundamental data structures like linked lists.

usrusr · on Sept 30, 2016

Engineers avoiding hash tables, yes, that's one myth. Still waiting for the other four promised in the headline. In the age of javascript everywhere I'd say that hash tables have pretty much won.

I was curious about which corner of software technology the author was coming from where hash tables are routinely questioned (embedded C maybe?), but even the followup post does not quite explain it. ( https://hughewilliams.com/2013/06/03/reflecting-on-my-hash-t... , also references a previous hn discussion)

masklinn · on Sept 30, 2016

> In the age of javascript everywhere I'd say that hash tables have pretty much won.

All of Lua, PHP, Ruby and Python are hash tables flying around, and pretty much everything web is going to involve tons of hashtables whatever your language as HTTP headers, QS parameters and form-encoded data have arbitrary named fields.

usrusr · on Oct 1, 2016

Makes me wonder if today's Lisps are actually "Hasps" in disguise.

(unwritten rule of hn: don't post terrible puns like this before the majority of voting is over)

jhdevos · on Sept 30, 2016

After interviewing many more and less experienced programmers (and asking most of them if they can explain a hash table - just for statistics!), I concur that most cannot.

There were quite some people that thought trees were faster than hash tables though - mostly the ones that had some incling of what a hash table does, but didn't know the entire picture. At least in all those cases I had the pleasure of educating them a little bit :-)

seanwilson · on Sept 30, 2016

> After interviewing many more and less experienced programmers (and asking most of them if they can explain a hash table - just for statistics!), I concur that most cannot.

I found this realisation pretty shocking myself. Several experienced Java developers told me they hadn't even heard the phrase "linked list" before when Java even has a class called LinkedList. :(

I see threads on here all the time about how interviews are broken and you shouldn't be expected to be quizzed on data structures if you're an experienced programmer but I don't agree with that. If you claim to be an experienced programmer and can't explain roughly how a hash table or a linked list works you've obviously got big holes in your knowledge in the areas of optimisation and scalability.

Annatar · on Sept 30, 2016

I see threads on here all the time about how interviews are broken and you shouldn't be expected to be quizzed on data structures if you're an experienced programmer

Anyone who thinks that an experienced programmer shouldn't know data structures just isn't programmer material: someone who truly appreciates this profession will have had formal education in algorithms and data structures, and if they do not, they will make sure they learn it on their own. Otherwise - this is how we get bloated crapware.

I just cannot up-vote your comment enough, if I could, I would up-vote it with my both hands and feet!

I always think back to The Art of UNIX programming:

Rule of Representation: Fold knowledge into data, so program logic can be stupid and robust.

http://catb.org/~esr/writings/taoup/html/ch01s06.html#id2878...

vonmoltke · on Sept 30, 2016

> If you claim to be an experienced programmer and can't explain roughly how a hash table or a linked list works you've obviously got big holes in your knowledge in the areas of optimisation and scalability.

What is your definition of "roughly" and how do you think lack of that knowledge affects knowledge of optimization and scalability? I ask not because I necessarily disagree, but because your statement is very subjective. Also because I used to write large-scale real-time signal processing code that didn't use hash tables at all, so lack of knowledge there wouldn't have this effect on our code.

dkersten · on Sept 30, 2016

> how do you think lack of that knowledge affects knowledge of optimization and scalability?

I once wrote a log analysis python script. It took 2.5 hours to execute to completion. I then used Cython to compile it to C. It then took about 2 hours to run to completion.

I then stepped back and thought about the data structures I was using. I replaced some lists that were being searched with dictionaries that could were keyed by what I was searching my lists for. Replaced a few other simple data structures.

Now the script took, IIRC, 20 seconds to run to completion.

Now, I don't think you necessarily need to know all of the internal details of how a data structure is implemented to know its performance characteristics or when to use (or not use) them. However, the point is that by understanding their performance characteristics, code can be optimized or made more scalable.

Also, while something like a hash table might not be faster in actual real life terms (ie worse O() complexity can still be better wall clock; eg searching an array linearly can be faster than accessing a hash table since the array can be in cache and prefetched) when run on a single machine, big-O complexity can start mattering much more when you have a distributed workload or TB worth of data. (Although if you have a distributed workload, you have distributed systems problems to deal with too).

Having said all of this, without going into specific implementation details, hash tables and linked lists are incredibly simple and I see no reason why anyone calling themselves a programmer would have trouble with these. I mean, I wouldn't expect you to be able to write a good hash function or anything, but explaining the basic concept shouldn't be too hard for most programmers. If someone is self-taught, then ok maybe they haven't ever needed to learn this. Its definitely possible, but I would at least hope they would know basic performance characteristics or at least some rules of thumb as to when to use what data structure. Hopefully based on logic and not just hearsay ;-)

seanwilson · on Sept 30, 2016

> Having said all of this, without going into specific implementation details, hash tables and linked lists are incredibly simple and I see no reason why anyone calling themselves a programmer would have trouble with these. I mean, I wouldn't expect you to be able to write a good hash function or anything, but explaining the basic concept shouldn't be too hard for most programmers. If someone is self-taught, then ok maybe they haven't ever needed to learn this.

Even if you're self-taught, if you've been programming for a while you should be able to read up on what linked lists and hash tables are in 15 minutes and virtually all interview preparation guides mention data structures. If you've not had the curiosity to do that then that's a bad sign.

falcolas · on Sept 30, 2016

> If you've not had the curiosity to do that then that's a bad sign.

Or you're just working from the backs of giants using built-in types. It's easy to do, when you haven't had to dig into low level code before.

I won't disagree with your premise, but it took some pretty extraordinary circumstances for me to find value in taking the time to learn about the ins and outs of linked lists and hash tables, because my every day work never required it.

It was reading an article about cuckoo hashes which finally prompted me to go down that road, and I can't regret it. That said, I've still never had to implement my own hash table for work.

seanwilson · on Sept 30, 2016

> It was reading an article about cuckoo hashes which finally prompted me to go down that road, and I can't regret it. That said, I've still never had to implement my own hash table for work

That's what I'm getting at, people that have the curiosity to find out more. Nobody should be implementing their own hash table but if you don't know about basic data structure and algorithm concepts you're not going to be sharp at coming up with new efficient solutions.

vonmoltke · on Sept 30, 2016

> Having said all of this, without going into specific implementation details, hash tables and linked lists are incredibly simple and I see no reason why anyone calling themselves a programmer would have trouble with these. I mean, I wouldn't expect you to be able to write a good hash function or anything, but explaining the basic concept shouldn't be too hard for most programmers. If someone is self-taught, then ok maybe they haven't ever needed to learn this. Its definitely possible, but I would at least hope they would know basic performance characteristics or at least some rules of thumb as to when to use what data structure. Hopefully based on logic and not just hearsay ;-)

That's why I was trying to establish what "roughly" means. This sounds like a reasonable definition.

wongarsu · on Sept 30, 2016

>how do you think lack of that knowledge affects knowledge of optimization and scalability?

In any program that has to handle just tenthousands of items ocasionally, knowing the difference between arrays, liked lists and pointer lists, or between various forms of hash tables, various forms of trees and sorted lists makes a big difference. This kind of knowledge affects how any program scales with input size and should be one of the first things checked when optimizing.

And I'd argue that just memorizing big-O properties only gets you half the way: heapsort is theoretically superiour to quicksort (better worst case, more space efficient, same average case), yet quicksort is used far more often because it makes better use of cpu caches and has a worst case that barely ever happens.

> Also because I used to write large-scale real-time signal processing code that didn't use hash tables at all, so lack of knowledge there wouldn't have this effect on our code.

Is the fact that it doesn't use hash tables a coincidence or is it because the people designing and writing the system know their data structures and realized that hash tables don't fit the problem?

vonmoltke · on Sept 30, 2016

> In any program that has to handle just tenthousands of items ocasionally, knowing the difference between arrays, liked lists and pointer lists, or between various forms of hash tables, various forms of trees and sorted lists makes a big difference.

This is why I asked for clarification. To me "knowing the difference between" is different from knowing how they "work". Maybe it is just semantics in my head.

Edit because I forgot to address the second part:

> Is the fact that it doesn't use hash tables a coincidence or is it because the people designing and writing the system know their data structures and realized that hash tables don't fit the problem?

I suppose it is technically the latter, but it was glaringly obvious that hash tables weren't appropriate anywhere because none of our algorithms required associative lookups. Every algorithm had to touch every point of data it stored on every iteration anyway, so it was pointless to do anything but loop over it. Even if there had been a case where hash tables would have made sense, I'm not sure implementing a fixed-size hash table in C would have been worth the effort.

seanwilson · on Sept 30, 2016

> In any program that has to handle just tenthousands of items ocasionally, knowing the difference between arrays, liked lists and pointer lists, or between various forms of hash tables, various forms of trees and sorted lists makes a big difference. This kind of knowledge affects how any program scales with input size and should be one of the first things checked when optimizing.

Exactly. If a developer is picking data structures without any consideration of their growth complexity they're going to cause serious issues.

falcolas · on Sept 30, 2016

Putting on my "get off my lawn" hat here:

> they're going to cause serious issues

But not until it's gone to production, and everyone's gotten an "attaboy" for completing their project and moved on to green pastures. Those sudden problems with performance in production are now the operations team's problem, to out provision or get a maintenance team on it.

Yeah, I know it's not that way in a company with good management and technical leadership... but I've been burnt enough to have a "get off my lawn" hat for this occasion.

seanwilson · on Sept 30, 2016

> What is your definition of "roughly" and how do you think lack of that knowledge affects knowledge of optimization and scalability?

Just that you hash a value you want to store and transform that hash into an array index for fast lookup. If someone didn't know that I find it hard to believe they'd have much knowledge or interest about how to pick appropriate data structures for large collections as it's such a fundamental concept used in lots of places (e.g. indices, caching). I'm not even expecting knowledge of what happens when two keys hash to the same bucket.

> Also because I used to write large-scale real-time signal processing code that didn't use hash tables at all, so lack of knowledge there wouldn't have this effect on our code.

Is there a similar data structure question you would ask a candidate who wants to work in this domain? You don't use hashes at all?

vonmoltke · on Sept 30, 2016

> Just that you hash a value you want to store and transform that hash into an array index for fast lookup. If someone didn't know that I find it hard to believe they'd have much knowledge or interest about how to pick appropriate data structures for large collections as it's such a fundamental concept used in lots of places (e.g. indices, caching). I'm not even expecting knowledge of what happens when two keys hash to the same bucket.

Makes sense to me. I just wouldn't call that knowing how a hash table "works", which is why I asked for clarification.

> Is there a similar data structure question you would ask a candidate who wants to work in this domain?

I would focus on fixed-size data structures, since those were critical to our code. I would find out if they could go lower than Big-O (which was, quite frankly, useless to us) when thinking about runtime complexity and whether they could relate what the various operations were doing to what the processor was doing.

> You don't use hashes at all?

That code had no hashes whatsoever. We had no need to look things up in that manner.

emodendroket · on Sept 30, 2016

To be fair, LinkedList is of very limited use. I can't think of a scenario I'd use it in beyond an LRU cache.

seanwilson · on Sept 30, 2016

> To be fair, LinkedList is of very limited use. I can't think of a scenario I'd use it in beyond an LRU cache.

It's more the point that not knowing they exist is a very strong indicator you haven't read even introductory text on algorithms and data structures. I barely use them either but when I'm thinking of solutions to problems they're a useful concept to know.

emodendroket · on Sept 30, 2016

Yeah, agreed. They are extremely simple.

Annatar · on Sept 30, 2016

A smartly designed doubly-linked list can be used for extremely fast deletion or insertion, as well as handling dynamic memory allocation for arbitrarily large data sets.

Linked lists are the foundation of operating systems, especially UNIX and AmigaOS.

emodendroket · on Sept 30, 2016

Yeah, OK, that is another good example. But nevertheless, there are a lot of programs you could be asked to write where using a linked list doesn't make any sense.

falcolas · on Sept 30, 2016

If you're not familiar with intrusively linked lists, it's worth having a gander at. Learning and implementing a few of those gave me a lot more usecases for linked lists.

vonmoltke · on Sept 30, 2016

I have used a "light" linked list (a linked list that contains just three pointers) to efficiently sort arrays with very large records in C.

falcolas · on Sept 30, 2016

> Not sure about other people here but I haven't heard any developers I work with avoid hash tables for reasons like this.

I've found there's a vast difference with many engineers between their willingness to use a structure called a "Hash Table" as compared to a Python/Swift "Dictionary", JavaScript "Object", or Go "Map".

Yes, they're all hash tables underneath, but so few people actually ever make that connection consciously. Hash tables have a lot of baggage from our CS classes, and their O(n) is foremost on people's minds when thinking about them. Objects, on the other hand, are so ubiquitous and programmer friendly that you just use them without thinking that they are a hash table, and what that means.

acidbaseextract · on Sept 30, 2016

A whole post about hash tables and he doesn't mention the real reason trees are better than hash tables: lower variance!

> Hash tables become full, and bad things happen

The bad thing is not a crazy long probe or even rehashing to enlarge the table—it's that some random individual insert is stuck eating the entire cost of rebuilding the table. For many applications, I'm happy with a slightly higher cost per operation in return for predictable performance.

> Hash functions are slow

Great, another unmentioned footgun - hash functions are hard. For example, the hash function he provides doesn't use a prime number of buckets. Oops, non-uniform input data in combination with unlucky bucket counts can generate high numbers of collisions.

Writing a comparison function to put your data in a tree is pretty stupidproof!

whack · on Sept 30, 2016

There are resizing algorithms out there which specifically avoid the single-insertion-spike that you're worried about.

http://stackoverflow.com/a/2200345

If you're working with small data sizes, the difference between O(1) and O(lgN) isn't that big a deal, so you can use whatever you want. But once you start working with larger datasets containing millions of elements, and the only thing you care about is insertions and exact-match-lookups, it's hard to justify using a tree over a hash map.

mojuba · on Sept 30, 2016

For similar reasons I prefer reference counting over (tracing) garbage collection, especially in near-realtime applications, such as audio.

emodendroket · on Sept 30, 2016

I'm all for learning for hash tables work, but is avoiding them really widespread? Especially in dynamic languages I feel like dynamic arrays and hash tables are like 90% of the data structures used.

falcolas · on Sept 30, 2016

No, dynamic languages use Maps and Dictionaries and Objects and Containers and...

Sure, they're all hash tables under the hood, but if you go up to an average JavaScript programmer and ask them the difference between a Hash Table and an Object, I'm sure you'd get a surprising (to you) answer.

emodendroket · on Sept 30, 2016

I mean, not even under the hood. Those are just different names for the same thing.

masklinn · on Sept 30, 2016

Due to the way JS objects are used, modern runtime will generally try to use structures for them (e.g. "hidden classes" in V8/Chrome) rather than "general-purpose" hashmaps. Likewise JS arrays incidentally. They will often need to deoptimise back to general-purpose hashmaps, but they'll try pretty hard not to.

DiThi · on Sept 30, 2016

If an object is being used as a hash map, you can avoid hidden class optimizations by deleting whatever from it:

    var map = {};
    delete map.x;

emodendroket · on Sept 30, 2016

Oh. That's neat.

masklinn · on Sept 30, 2016

Yep. V8's JIT will also try to specialise functions based on that so if you always pass the same type (where type = hidden class), a limited number of types or "anything goes". Or at least it did a few years back.

saw-lau · on Sept 30, 2016

Interesting that the author offers the chained hash table as an alternative implementation - that was always how I was taught they worked back in school. (1986?)

douche · on Sept 30, 2016

Same here, graduated in the 2000s. The way hash tables are described in the opening section sounds really naive, and I would hope that real implementations aren't using that method.

mason55 · on Sept 30, 2016

Go find the post on the history of dict in Python from the other day to see how they move away from chaining.

emodendroket · on Sept 30, 2016

AFAIK most real-world dictionary classes use linear probing.

gpderetta · on Sept 30, 2016

std::unordered_map and friends use chaining (and they suffer from it).

emodendroket · on Sept 30, 2016

That's because that implementation is more obvious but has worse real-world performance than linear probing.

josefx · on Sept 30, 2016

The worst case was once used to attack Java based servers, using colliding post parameters I think. Since then the Java HashMap sorts colliding entries if the entries implement Comparable.

bhaak · on Sept 30, 2016

Needs a (2012) in the title.

The hash collisions you mentioned probably got fixed after or because of these CCC congress talks:

https://events.ccc.de/congress/2011/Fahrplan/events/4680.en....

https://events.ccc.de/congress/2012/Fahrplan/events/5152.en....

The article talks about peculiar myths that I haven't heard before. And the first one "the worst case is terrible" is not even a myth if you're hashing function isn't well designed and you have a malicious data provider.

masklinn · on Sept 30, 2016

> The worst case was once used to attack Java based servers

The worst case was revealed able to efficiently DOS pretty much any hashtable-based stack back in 2011[0] (practical attacks on real-world implementations, not theoretical work), following which most stacks added hash randomisation during 2012.

[0] https://events.ccc.de/congress/2011/Fahrplan/attachments/200...

danbruc · on Sept 30, 2016

But that is of course true for all algorithms without uniform performance, another famous example being sorting with quicksort which on average takes O(n x log(n)) time but can become O(n²) in the worst case. Or zip bombs making you decompress a few kilobytes of data into gigabytes. If a user can control a relevant fraction or important piece of the data you are processing, then you always have to be careful not to expose yourselves to algorithmic complexity attacks.

jkot · on Sept 30, 2016

Also String.hashCode is broken, it generates too many collisions. It can not be fixed for historic reasons.

arethuza · on Sept 30, 2016

Wasn't an early version really broken - where it only looked at the first N characters of a string when hashing it? Where N was actually pretty low.

jkot · on Sept 30, 2016

Perhaps, but it made sense with 16MB RAM and 486 CPU. Current implementation uses 31 multiplier which is useless over 1M entries. This problem is also in Arrays.hash*() and many other places.

espadrine · on Sept 30, 2016

The answer to that has always been universal hash functions.

You have a family of hash functions, one of which is picked randomly. In the family, each hash function is such that Pr(collision) ≤ Pr(randomly picking that hash function).

That way, for a sufficiently large family, it is virtually impossible to hit its worst case.

rurban · on Sept 30, 2016

No, it's not. With enough data leaking, like ordering and timings attacks are computable. Only a proper collision resolution method can prevent from DoS attacks. The blog post doesn't even know about modern hash tables, not chasing ptrs in a linked list, which is not recommended where ptrs chasing is 50x slower than adjacent entries (cpu's of the last 15 years). But for newbies it's at least a beginning.

Annatar · on Sept 30, 2016

Hash arrays are great; I use them all the time in AWK. All searches using hash indexed arrays in AWK are O(1).

I believe that the biggest problem around using hashes as indices entails people wrapping their head around the concept that the array index is a string (for all intents and purposes), rather than a number of a field in an array (or an address of a region in memory, which it eventually is anyway, once it runs). For example, I used entire lines of text as the hash for an array. My colleague who I was showing this technique to was completely dumbfounded as to how that worked. Why was I only storing the value "1" into Array["ORA-1234: blablabla"]?

  if(Array[$0] != 1)
  {
    #
    # This record is different, so print the entire record.
    #
    print;
  }

"How is that used to filter out lines which are identical the ones we have in the canonical file?" He seemed completely flabbergasted. He just couldn't wrap his head around this concept, and ended up re-implementing the entire thing using several lines of grep -v, sed, and if [ ...] then. I think there were even several temporary files in the game, whereas the hash array technique in AWK loaded both the canonical file and the input from stdin into memory, rather than using intermediate files.

krylon · on Sept 30, 2016

One interesting alternative to hash tables that appears to get too little love is the Judy tree: https://en.wikipedia.org/wiki/Judy_array

They are not quite as versatile, I think, but for simple cases where your keys are strings or ints, they work just as well and are quite fast, too. The API is very simple, too.

I don't do much programming in C these days (pretty much none at all), but I used Judy in one toy project and have fond memories of it.

matheweis · on Sept 30, 2016

Neat! I developed something similar back in college in JAVA - it was a ternary radix tree, and wow was that thing fast... something like only about 10% slower than HashMap for storing the entire English dictionary and looking up an arbitrary string a million times.

There were all sorts of other interesting advantages relative to hash tables since I kept it sorted... you could retrieve an element and then find five adjacent elements. Handy for something like a dictionary app etc.

stuxnet79 · on Sept 30, 2016

Gotta love HN. I'm always learning new stuff. Thanks for posting this. The "Judy Shop Manual" (http://judy.sourceforge.net/application/shop_interm.pdf) looks very complicated so I don't know if I should take the time to fully understand how they work, especially if they are not as versatile as hash tables.

krylon · on Sept 30, 2016

Judy's author appears to have poured bathtubs full of cleverness into his brainchild to make it fast at the cost of implementation complexity.

As long as you don't look behind the curtain, it's genius - an API so simple a child could use it, and it's so fast... But try to understand how it works, and your brain melts. Well, mine, at least.

Then again, I suspect the same can be said of what used to be called the STL, or Boost. In a library that is meant to be used by lots of other projects, the trade-off is legitimate.

greg7mdp · on Sept 30, 2016

If you are interested in hash tables, check out my improved version of Google's already excellent sparsehash at https://github.com/greg7mdp/sparsepp. Excellent performance, very low memory usage, grows as needed of course.

personjerry · on Sept 30, 2016

This seems like Data Structures 101

stefs · on Sept 30, 2016

potential offtopic question here: i always wondered about multi-layered hashtables, i.e. instead of a linked list for resolving collisions use a second hashtable with a different hash function (and then maybe a linked list on the 3rd level).

i guess this might bring some potential improvements in case of a hashtable with too many collisions but a prohibitively expensive overhead in all other cases. i.e. it might alleviate a hashtable that was already broken.

rurban · on Sept 30, 2016

This is a bad idea, since double hashing with open addressing has the same benefits but is much faster. The indices are already in the cache and there's need for an additional hash table setup overhead. Usually you go for a variant of robin-hood, cuckoo or even quadratic hashing, with the same benefits but vastly better performance. (all open addressing schemes). And if the collision rate ("load factor"/"fill rate") is too high for your use case, you go for lower load factors, down from 90% to 50%. It's still much faster than using a complete separate 2nd data structure.

And if you need something better than a hash table, (pointers, fixed size keys, ordered, ...) you usually go for patricia/crit-bit trees.

One popular database tried a RB-tree as second data structure once (forgot the name and link. starts with R but not redis). Also see http://programmers.stackexchange.com/a/281785/27031, DMDScript used that also.

hueving · on Sept 30, 2016

He mentioned his own home grown hash with the bit shifts. How does it compare to common ones that use an AES instruction if the crypto chip is available?

rurban · on Sept 30, 2016

Poor. See https://github.com/rurban/smhasher

stefs · on Sept 30, 2016

i think the article omits one interesting point - cache coherence. afaik this is the one reason otherwise inefficient data structures might result in better real world performance for small data sets.

marcosdumay · on Sept 30, 2016

Well, asymptotic analysis is not very useful for small data sets.