Uncategorized - data, code and conversation

2022-11-082022-11-08

BigQuery RegExp for Matching Single Emoji

If anyone needs a regular expression for matching single Emoji in BigQuery/RE2, this is a good place to start:

(?:[\x{1F300}-\x{1F5FF}]|[\x{1F900}-\x{1F9FF}]|[\x{1F600}-\x{1F64F}]|[\x{1F680}-\x{1F6FF}]|[\x{2600}-\x{26FF}]\x{FE0F}?|[\x{2700}-\x{27BF}]\x{FE0F}?|\x{24C2}\x{FE0F}?|[\x{1F1E6}-\x{1F1FF}]{1,2}|[\x{1F170}\x{1F171}\x{1F17E}\x{1F17F}\x{1F18E}\x{1F191}-\x{1F19A}]\x{FE0F}?|[\\x{0023}\x{002A}\x{0030}-\x{0039}]\x{FE0F}?\x{20E3}|[\x{2194}-\x{2199}\x{21A9}-\x{21AA}]\x{FE0F}?|[\x{2B05}-\x{2B07}\x{2B1B}\x{2B1C}\x{2B50}\x{2B55}]\x{FE0F}?|[\x{2934}\x{2935}]\x{FE0F}?|[\x{3297}\x{3299}]\x{FE0F}?|[\x{1F201}\x{1F202}\x{1F21A}\x{1F22F}\x{1F232}\x{1F23A}\x{1F250}\x{1F251}]\x{FE0F}?|[\x{203C}-\x{2049}]\x{FE0F}?|[\x{00A9}-\x{00AE}]\x{FE0F}?|[\x{2122}\x{2139}]\x{FE0F}?|\x{1F004}\x{FE0F}?|\x{1F0CF}\x{FE0F}?|[\x{231A}\x{231B}\x{2328}\x{23CF}\x{23E9}\x{23F3}\x{23F8}\x{23FA}]\x{FE0F}?)

Inspired by zly394/EmojiRegex on GitHub.

2022-08-26

Announcing My First Commercial API, arachn.io!

I’ve just released my first commercial API, arachn.io. It’s a simple thing, focusing on performing link unwinding and data extraction affordably at scale. It’s also dead simple to use.

Check out the Free Forever plan to try it out yourself!

2022-04-082022-04-15

OpenAPI Generator Template Customization Example

The OpenAPI Generator is a wonderful bit of tech. It allow users to create an OpenAPI spec, and then generate client and server code from it in a variety of languages and platforms. I use it to generate DTOs for all my APIs, and to generate service interfaces to keep my client and server in sync. It also simplifies contract testing, if that’s your bag.

The generator is quite powerful out of the box, but it doesn’t do everything. Fortunately, it’s also extremely customizable, so if and when you find something that’s not supported out of the box, then you can make it work with minimal muss and fuss using template customization. Here’s how.

2021-01-042021-01-05

A Dendrogram Layout in Gephi

As part of a side project, I ended up producing a large dendrogram. Roughly 22,000 items were clustered, which produced over 44,000 nodes! Plotting such a large dendrogram is tough.

After trying to convince some tools to render a dendrogram this large in a way that is useful, I decided to render it as a graph using Gephi instead. While Gephi is not incredibly intuitive to use and isn’t the most stable piece of software, it is the easiest way to render large graphs. Rendering the dendrogram as a graph throws away the dissimilarity aspect of the rendering, but for a dataset this large, it’s tough to parse that anyway.

To get the best layout possible, I ended up using Gephi’s timeline feature. It was a little tricky to get right, so I thought I’d document the process here in case it’s useful to others.

2018-08-102018-10-07

String.hashCode() is plenty unique

I’ve been running across this article all over Reddit for the past couple of days.

It’s driving me crazy.

The article (rightfully) points out that Java’s humble String.hashCode() method — which maps arbitrary-length String objects to 32-bit int values — has collisions. The article also (wrongfully) makes this sound surprising, and claims that the String.hashCode() algorithm is bad on that basis. In the author’s own words:

No matter what hashing strategy is used, collisions are enevitable (sic) however some hashes are worse than others. You can expect String to be fairly poor.

That’s pretty strongly worded!

The author has demonstrated that String.hashCode() has collisions. However, being a 32-bit String hash function, String.hashCode() will by its nature have many collisions, so the existence of collisions alone doesn’t make it bad.

The author also demonstrated that there are many odd, short String values that generate String.hashCode() collisions. (The article gives many examples, such as !~ and "_.) But because String.hashCode()‘s hash space is small and it runs so fast, it will even be easy to find examples of collisions, so being able to find the collisions doesn’t make String.hashCode() bad either. (And hopefully no one expected it to be a cryptographic hash function to begin with!)

So none of these arguments that String.hashCode() is a “poor” hash function are convincing. Additionally, I have always found (anecdotally) that String.hashCode() manages collisions quite well for Real World Data.

So what does a “poor” hash function look like? And what does a “good” hash function look like? And where does String.hashCode() fall on that spectrum?

Continue reading “String.hashCode() is plenty unique”

2018-04-302018-08-11

Populating a EC2- and Fargate Cross-Compatible ECS Cluster from a Spot Fleet

When Amazon ECS was first released back in April 2015, it left a lot to be desired: tasks and services could only be run on a cluster you managed, clusters had limited support for limited support for autoscaling and spot instances, and so on. Amazon filled these gaps over the next couple of years with support for scaling policies, great blog posts on integrating spot fleets with ECS, and now even a wizard cluster builder for EC2-only clusters. But even while the tools improved, you were still managing cluster capacity, which is a pain. ECS really came into its own with the release of Fargate, which allows you to run ECS tasks on Amazon-managed “virtual capacity” in your cluster so you could finally stop counting servers.

While Fargate is great, it still costs more than running on Spot Fleets, and it can take a few minutes to start tasks. This isn’t a problem for most static workloads, but the delay in particular can be irritating for dynamic loads, especially when users are waiting for containers to start and stop. As a result, my team has started keeping a reasonably-sized EC2 Spot Fleet cluster warm, and then using Fargate for overflow. This provides the best possible user experience: most things start quickly and run cheaply, but nothing fails due to insufficient capacity.

The trick with this configuration, though, is you need to configure your Spot Fleet so that the same tasks can run with LaunchType set to EC2 or FARGATE. It’s not trivial; here are some of our lessons learned.

Continue reading “Populating a EC2- and Fargate Cross-Compatible ECS Cluster from a Spot Fleet”

2013-01-042017-12-03

A Year of Mashable — 2012

At work I finally got around to doing a project I’ve been wanting to do for a long time: analyze the sharing behavior of a year’s worth of content at Mashable.

It’s no small project. First, a year’s worth of Mashable content must be collected, which ended up being 13,979 articles in total. Next, the author, publish date, headline, and full text of each post must be extracted from each page, which requires a (fortunately simple) custom scraper to be built. Next, the social resonance data of each article must be collected. For this analysis, I collected share counts for Twitter, Facebook, StumbleUpon, LinkedIn, Google+, and Pinterest, plus clicks from Bitly and per-article submissions from Reddit. Continue reading “A Year of Mashable — 2012”

2012-08-112017-12-03

Should I Use an ORM or Not? Sure.

There are a whole lot of strong opinions about ORM floating around the internet and elsewhere. When you see so many passionate, conflicting opinions in so many different threads, it’s a pretty clear sign you’re looking at a religious argument rather than a rational debate. And, as in any good religious argument — big endian or little endian, butter side up or butter side down, vi or emacs, Team Jacob or Team Edward — this one has two sides, too.

Still a Better Love Story than Twilight.

Continue reading “Should I Use an ORM or Not? Sure.”

2012-05-042017-12-03

Complication is What Happens When You Try to Solve a Problem You Don’t Understand

Code should be simple. Code should be butt simple. Code should be so simple that there’s no way it can be misunderstood. Good code has no nooks. Good code has no crannies. Good code is a round room with no corners for bugs to hide in.

We all know this. So why does most code suck?

Because it’s written by people who don’t understand the problem they’re trying to solve.

Continue reading “Complication is What Happens When You Try to Solve a Problem You Don’t Understand”

2012-04-122017-12-03

Learning to Program

Eventually, every programmer blogs about how to become a better programmer. It seems to be the price of admission to the industry. Programmers are a vain lot, and every one of us likes to think he has a unique viewpoint to contribute with insightful advice and meaningful guidance. The reality is that the “learn how to program” post is cliché. There are so many that each new one is nothing more than an echo of some old, vaguely-remembered, proto-learn-how-to-program-post. No one should write another. There’s no point.

So obviously I’m going to write another.

Continue reading “Learning to Program”