String.hashCode() is plenty unique - data, code and conversation

I’ve been running across this article all over Reddit for the past couple of days.

It’s driving me crazy.

The article (rightfully) points out that Java’s humble String.hashCode() method — which maps arbitrary-length String objects to 32-bit int values — has collisions. The article also (wrongfully) makes this sound surprising, and claims that the String.hashCode() algorithm is bad on that basis. In the author’s own words:

No matter what hashing strategy is used, collisions are enevitable (sic) however some hashes are worse than others. You can expect String to be fairly poor.

That’s pretty strongly worded!

The author has demonstrated that String.hashCode() has collisions. However, being a 32-bit String hash function, String.hashCode() will by its nature have many collisions, so the existence of collisions alone doesn’t make it bad.

The author also demonstrated that there are many odd, short String values that generate String.hashCode() collisions. (The article gives many examples, such as !~ and "_.) But because String.hashCode()‘s hash space is small and it runs so fast, it will even be easy to find examples of collisions, so being able to find the collisions doesn’t make String.hashCode() bad either. (And hopefully no one expected it to be a cryptographic hash function to begin with!)

So none of these arguments that String.hashCode() is a “poor” hash function are convincing. Additionally, I have always found (anecdotally) that String.hashCode() manages collisions quite well for Real World Data.

So what does a “poor” hash function look like? And what does a “good” hash function look like? And where does String.hashCode() fall on that spectrum?

A Bit of Theory

A 32-bit hash can only take 2^32 = 4,294,967,296 unique values. Because a String can have any number of characters in it, there are obviously more possible Strings than this. Therefore, collisions must exist because of the pigeonhole principle.

But what is the likelihood of a collision?

First, assume an “ideal” hash function is being analyzed. An ideal hash function distributes its outputs uniformly and independently across the hash space. In other words, for all possible inputs, an ideal hash function’s outputs should have no pattern, even if the inputs do.

The famous — and counter-intuitive! — birthday problem states that for 365 possible “hash values,” only 23 unique hashes must be computed before there is a 50% chance of a hash collision, even for ideal hash functions. If there are 2^32 possible hash values, roughly 77,164 unique hashes must be computed before there is a 50% chance of a hash collision, per this approximation:

from math import exp
def prob(x):
    return 1.0 -exp(-0.5 * x * (x-1) / 2**32)
prob(77163) # 0.4999978150170551
prob(77164) # 0.500006797931095

Given this as context, for a given hash space size n and computed hashes d, how many collisions are expected for an ideal hash function? Wikipedia provides a clean closed-form equation:

def count(d, n):
    return n - d + d * ((d - 1) / d)**n

This is a fine benchmark for collision performance. Now, how does String.hashCode() perform compared to a theoretical ideal hash function?

A Bit of Practice

How does String.hashCode() measure up? This test harness should run experiments nicely:

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;

import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;

import java.nio.charset.StandardCharsets;

public class HashTest {
    private static  Map<Integer,Set> collisions(Collection values) {
        Map<Integer,Set> result=new HashMap<>();
        for(T value : values) {
            Integer hc=Integer.valueOf(value.hashCode());
    
            Set bucket=result.get(hc);
            if(bucket == null)
                result.put(hc, bucket = new TreeSet<>());
    
            bucket.add(value);
        }
        return result;
    }

    public static void main(String[] args) throws IOException {
        System.err.println("Loading lines from stdin...");
        Set lines=new HashSet<>();
        try (BufferedReader r=new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8))) {
            for(String line=r.readLine();line!=null;line=r.readLine())
                lines.add(line);
        }

        // Warm up, if you please
        System.err.print("Warming up");
        for(int i=0;i<10;i++) {
            System.err.print(".");
            collisions(lines);
        }
        System.err.println();

        System.err.println("Computing collisions...");
        long start=System.nanoTime();
        Map<Integer,Set> collisions=collisions(lines);
        long finish=System.nanoTime();
        long elapsed=finish-start;

        int maxhc=0, maxsize=0;
        for(Map.Entry<Integer,Set> e : collisions.entrySet()) {
            Integer hc=e.getKey();
            Set bucket=e.getValue();
            if(bucket.size() > maxsize) {
                maxhc = hc.intValue();
                maxsize = bucket.size();
            }
        }

        System.out.println("Elapsed time: "+elapsed+"ns");
        System.out.println("Total unique lines: "+lines.size());
        System.out.println("Time per hashcode: "+String.format("%.4f", 1.0*elapsed/lines.size())+"ns");
        System.out.println("Total unique hashcodes: "+collisions.size());
        System.out.println("Total collisions: "+(lines.size()-collisions.size()));
        System.out.println("Collision rate: "+String.format("%.8f", 1.0*(lines.size()-collisions.size())/lines.size()));
        if(maxsize != 0)
            System.out.println("Max collisions: "+maxsize+" "+collisions.get(maxhc));
    }
}

How does String.hashCode() perform for short English String values? Treating a series of short English words as input yields this output:

$ cat words.txt | java HashTest
Loading lines from stdin...
Warming up..........
Computing collisions...
Elapsed time: 49117411ns
Total unique lines: 466544
Time per hashcode: 105.2793ns
Total unique hashcodes: 466188
Total collisions: 356
Collision rate: 0.00076306
Max collisions: 3 [Jr, KS, L4]

On short inputs in English, 466,544 total hashes resulted in 356 collisions. An “ideal” hash function would generate an expected 25.33 collisions in the same circumstances. Therefore, String.hashCode() generates roughly 14.05x more collisions than an ideal hash function would for these inputs:

356.0 / 25.33 ≈ 14.05

An aggregate collision rate of 8 per 10,000 still isn’t bad in an absolute sense.

And how does String.hashCode() perform for long English String values? Treating each unique line in the complete works of Shakespeare produces this output:

$ cat shakespeare.txt | java HashTest
Loading lines from stdin...
Warming up..........
Computing collisions...
Elapsed time: 24106163ns
Total unique lines: 111385
Time per hashcode: 216.4220ns
Total unique hashcodes: 111384
Total collisions: 1
Collision rate: 0.00000897
Max collisions: 2 [    There's half a dozen sweets.,   PISANIO. He hath been search'd among the dead and living,]

On longer inputs in English, 111,385 total hash resulted in 1 collision. An ideal hash function would generate an expected 1.44 collisions over this data. String.hashCode()‘s performance is on par with an ideal hash function in this case:

1 / 1.44 ≈ 0.694

Less than 1 collision per 100,000 hashes is excellent in an absolute sense as well.

A Bit of Interpretation

Obviously, String.hashCode() isn’t unique, but it can’t be. For short values it’s within an order of magnitude of theoretical ideal average efficiency. For long values, it performs in line with an ideal theoretical solution.

Several people on Reddit and Hacker News have pointed out that this approach is not a rigorous statistical analysis. That’s true! And I don’t mean to present it as such. (The word “significant” did creep into one draft; I’ve eliminated it.) For an example of what a much more thorough analysis of hash function performance might look like, I encourage readers to check out the links in the Further Reading section.

However, hopefully this at least demonstrates that String.hashCode() is plenty unique for its intended purpose, which is spreading String values out across a hash table, and that its collision performance is at least “OK.”

A Bit of Futher Reading

If you found this interesting, I highly recommend you check out this answer on Stack Overflow. It goes into wonderful depth on hash function collision and wall-clock performance.

Also, /u/_INTER_ on Reddit pointed out that there is a JEP for improving Java hash functions!

EDIT: Upon reflection, I realized that what caused me to put this (simple) analysis together was the subjective claim that String.hashCode() performed “poorly.” In my experience, it works quite well for its intended purpose, and I wanted to show that. My first draft used hand-waving (at least) as much as the original article when trying to make that case. I hope introducing a more theoretical framework in this draft provides a more convincing argument that String.hashCode() at least isn’t “poor.”