String.hashCode() is plenty unique

I’ve been running across this article all over Reddit for the past couple of days.

It’s driving me crazy.

The article (rightfully) points out that Java’s humble String.hashCode() method — which maps arbitrary-length String objects to 32-bit int values — has collisions. The article also (wrongfully) makes this sound surprising, and claims that the String.hashCode() algorithm is bad on that basis. In the author’s own words:

No matter what hashing strategy is used, collisions are enevitable (sic) however some hashes are worse than others. You can expect String to be fairly poor.

That’s pretty strongly worded!

The author has demonstrated that String.hashCode() has collisions. However, being a 32-bit String hash function, String.hashCode() will by its nature have many collisions, so the existence of collisions alone doesn’t make it bad.

The author also demonstrated that there are many odd, short String values that generate String.hashCode() collisions. (The article gives many examples, such as !~ and "_.) But because String.hashCode()‘s hash space is small and it runs so fast, it will even be easy to find examples of collisions, so being able to find the collisions doesn’t make String.hashCode() bad either. (And hopefully no one expected it to be a cryptographic hash function to begin with!)

So none of these arguments that String.hashCode() is a “poor” hash function are convincing. Additionally, I have always found (anecdotally) that String.hashCode() manages collisions quite well for Real World Data.

So what does a “poor” hash function look like? And what does a “good” hash function look like? And where does String.hashCode() fall on that spectrum?

Continue reading “String.hashCode() is plenty unique”

Populating a EC2- and Fargate Cross-Compatible ECS Cluster from a Spot Fleet

When Amazon ECS was first released back in April 2015, it left a lot to be desired: tasks and services could only be run on a cluster you managed, clusters had limited support for limited support for autoscaling and spot instances, and so on. Amazon filled these gaps over the next couple of years with support for scaling policies,  great blog posts on integrating spot fleets with ECS, and now even a wizard cluster builder for EC2-only clusters.  But even while the tools improved, you were still managing cluster capacity, which is a pain. ECS really came into its own with the release of Fargate, which allows you to run ECS tasks on Amazon-managed “virtual capacity” in your cluster so you could finally stop counting servers.

While Fargate is great, it still costs more than running on Spot Fleets, and it can take a few minutes to start tasks. This isn’t a problem for most static workloads, but the delay in particular can be irritating for dynamic loads, especially when users are waiting for containers to start and stop. As a result, my team has started keeping a reasonably-sized EC2 Spot Fleet cluster warm, and then using Fargate for overflow. This provides the best possible user experience: most things start quickly and run cheaply, but nothing fails due to insufficient capacity.

The trick with this configuration, though, is you need to configure your Spot Fleet so that the same tasks can run with LaunchType set to EC2 or FARGATE. It’s not trivial; here are some of our lessons learned.

Continue reading “Populating a EC2- and Fargate Cross-Compatible ECS Cluster from a Spot Fleet”

A Year of Mashable — 2012

At work I finally got around to doing a project I’ve been wanting to do for a long time: analyze the sharing behavior of a year’s worth of content at Mashable.

It’s no small project. First, a year’s worth of Mashable content must be collected, which ended up being 13,979 articles in total. Next, the author, publish date, headline, and full text of each post must be extracted from each page, which requires a (fortunately simple) custom scraper to be built. Next, the social resonance data of each article must be collected. For this analysis, I collected share counts for Twitter, Facebook, StumbleUpon, LinkedIn, Google+, and Pinterest, plus clicks from Bitly and per-article submissions from Reddit. Continue reading “A Year of Mashable — 2012”

Should I Use an ORM or Not? Sure.

There are a whole lot of strong opinions about ORM floating around the internet and elsewhere. When you see so manypassionateconflicting opinions in so many different threads, it’s a pretty clear sign you’re looking at a religious argument rather than a rational debate. And, as in any good religious argument — big endian or little endian, butter side up or butter side down, vi or emacs, Team Jacob or Team Edward — this one has two sides, too.

Still a Better Love Story than Twilight.

Continue reading “Should I Use an ORM or Not? Sure.”

Complication is What Happens When You Try to Solve a Problem You Don’t Understand

Code should be simple. Code should be butt simple. Code should be so simple that there’s no way it can be misunderstood. Good code has no nooks. Good code has no crannies. Good code is a round room with no corners for bugs to hide in.

We all know this. So why does most code suck?

Because it’s written by people who don’t understand the problem they’re trying to solve.

Continue reading “Complication is What Happens When You Try to Solve a Problem You Don’t Understand”

Learning to Program

Eventually, every programmer blogs about how to become a better programmer. It seems to be the price of admission to the industry. Programmers are a vain lot, and every one of us likes to think he has a unique viewpoint to contribute with insightful advice and meaningful guidance. The reality is that the “learn how to program” post is cliché. There are so many that each new one is nothing more than an echo of some old, vaguely-remembered, proto-learn-how-to-program-post. No one should write another. There’s no point.

So obviously I’m going to write another.

Programming is Exactly Like This

Continue reading “Learning to Program”

The Origin of Perfect Software

In another post, I claimed that software can’t be written with no bugs at all. Well, it turns out that’s not quite true. What I shouldhave said is that writing bug-free software is not possible within the constraints of most software businesses or open-source projects.

But that just doesn’t have the same pizazz, does it?

The trouble is that software businesses exist to make money, and open source projects exist to give developers interesting things to do and exposure. (Naturally, there are some exceptions in both camps, but if you imagine that’s always true, you won’t be too far off.) And if these are the goals you’re chasing — customers and money, or interesting problems and exposure — you don’t end up with perfect software. You go broke or get bored before you get there.

Continue reading “The Origin of Perfect Software”

The Economics of Perfect Software

Ask 100 CEOs of software companies if they want to ship software with bugs. What will they say? 50 won’t answer at all, saying something about how bugs are a huge problem in the industry that needs to be addressed; 40 will say “Of course not!” and promptly call their shark tank in preparation for a lawsuit; 9 will hang their heads and say “we can’t help it”; and that last 1 will look you straight in the eye and say “Absolutely.”

I have no idea what that last guy’s doing heading up a software company, because he studied economics.

Continue reading “The Economics of Perfect Software”