It’s driving me crazy.
The article (rightfully) points out that Java’s humble
String.hashCode() method — which maps arbitrary-length
String objects to 32-bit
int values — has collisions. The article also (wrongfully) makes this sound surprising, and claims that the
String.hashCode() algorithm is bad on that basis. In the author’s own words:
No matter what hashing strategy is used, collisions are enevitable (sic) however some hashes are worse than others. You can expect String to be fairly poor.
That’s pretty strongly worded!
The author has demonstrated that
String.hashCode() has collisions. However, being a 32-bit
String hash function,
String.hashCode() will by its nature have many collisions, so the existence of collisions alone doesn’t make it bad.
The author also demonstrated that there are many odd, short
String values that generate
String.hashCode() collisions. (The article gives many examples, such as
"_.) But because
String.hashCode()‘s hash space is small and it runs so fast, it will even be easy to find examples of collisions, so being able to find the collisions doesn’t make
String.hashCode() bad either. (And hopefully no one expected it to be a cryptographic hash function to begin with!)
So none of these arguments that
String.hashCode() is a “poor” hash function are convincing. Additionally, I have always found (anecdotally) that
String.hashCode() manages collisions quite well for Real World Data.
So what does a “poor” hash function look like? And what does a “good” hash function look like? And where does
String.hashCode() fall on that spectrum?