data, code and conversation - from Andy Boothe

2022-07-232022-09-21

Community-Managed AWS Lambda Base Images for Java 17

I’m (finally) upgrading several projects to Java 17, the current LTS version. I like to deploy my Lambda functions as container images because it keeps the devops simple — everything is a container! — but there still isn’t an officially-supported base image Java 17, even though it’s been out for almost a year. So I made one and released it to the community. You can find the images on ECR Public Gallery and DockerHub, and the source code is on GitHub. If you’ve been waiting to adopt Java 17 until there was Lambda support, then these base images will let you get started.

2022-05-152022-05-17

Litecene: Full-Text Search for Google BigQuery

I just released the first beta version of litecene, a Java library that implements a common boolean search syntax for full-text search, with its first transpiler for BigQuery.

Litecene makes Searching Text in BigQuery Easy!

Litecene Syntax

Litecene uses a simple, user-friendly syntax that is derived from Lucene query syntax that should be familiar to most users. It includes clauses for term with wildcard, phrase with proximity, AND, OR, NOT, and grouping.

As an example, this might be a good Litecene query to identify social media posts talking about common ways people user their smartphones:

(smartphone OR "smart phone" OR iphone OR "apple phone" OR android OR "google phone" OR "windows phone" OR "phone app"~8) AND (call* OR dial* OR app OR surf* OR brows* OR camera* OR pic* OR selfie)

The syntax is documented in more detail here.

2022-04-282022-05-01

Fundamentals of Software Performance Analysis Part III — Optimizing Memory Allocation Performance

This is the third post in a three-post series covering the fundamentals of software performance analysis. You can find the introduction here. You can find Part II here. You can find the companion GitHub repository here.

Part II covered the process of using a profiler (VisualVM) to identify “hot spots” where programs spend time and then optimizing program code to improve program wall clock performance.

This post will cover the process of using a profiler to identify memory allocation hot spots and then optimizing program code to improve memory allocation performance. It might be useful to refer back to Part II if you need a refresher on how to use VisualVM or the optimization workflow.

2022-04-242022-05-01

Fundamentals of Software Optimization Part II — Optimizing Wall Clock Performance

This is the second post in a three-post series covering the fundamentals of software optimization. You can find the introduction here. You can find Part I here. You can find Part III here. You can find the companion GitHub repository here.

Part I covered the development of the benchmark, which is the “meter stick” for measuring performance, and established baseline performance using the benchmark.

This post will cover the high-level optimization process, including how to use profile software performance using VisualVM to identify “hot spots” in the code, make code changes to improve hot spot performance, and evaluate performance changes using a benchmark.

2022-04-162022-04-28

Fundamentals of Software Optimization Part I — Benchmarking

This is the first post in a three-post series covering the fundamentals of software optimization. You can find the introduction here. You can find Part II here. You can find the companion GitHub repository here.

The introduction motivated why software optimization is a problem that matters, reflected on the fundamental connection between the scientific method and software performance analysis, and documented the (informal) optimization goal for this series: to optimize the production workflow’s wall clock performance and memory usage performance “a lot.”

This post will cover the theory and practice of designing, building, and running a benchmark to measure program performance using JMH and establishing the benchmark’s baseline performance measurements.

2022-04-152022-05-01

Introduction to Fundamentals of Software Optimization

This is the introduction to a three-post series covering the fundamentals of software optimization. You can find Part I here. You can find the companion GitHub repository here.

Performance is a major topic in software engineering. A quick Google search for “performance” in GitHub issues comes up with about a million results. Everyone wants software to go fast – especially the users!

However, as a general problem, software optimization isn’t easy or intuitive. It turns out that software performance follows the Pareto principle — 90% of time is spent in 10% of code. It also turns out that in a large program, people — even professional software performance analysts who have spent their careers optimizing software — are really bad at guessing which 10% that is. So folks who try to make code go faster by guessing where the code is spending its time are actually much more likely to make things worse than better.

Fortunately, excellent tools exist to help people improve software performance these days, and many of them are free. This three-part series will explore these tools and how to apply them to optimize software performance quickly and reliably.

2022-04-082022-04-15

OpenAPI Generator Template Customization Example

The OpenAPI Generator is a wonderful bit of tech. It allow users to create an OpenAPI spec, and then generate client and server code from it in a variety of languages and platforms. I use it to generate DTOs for all my APIs, and to generate service interfaces to keep my client and server in sync. It also simplifies contract testing, if that’s your bag.

The generator is quite powerful out of the box, but it doesn’t do everything. Fortunately, it’s also extremely customizable, so if and when you find something that’s not supported out of the box, then you can make it work with minimal muss and fuss using template customization. Here’s how.

2021-01-042021-01-05

A Dendrogram Layout in Gephi

As part of a side project, I ended up producing a large dendrogram. Roughly 22,000 items were clustered, which produced over 44,000 nodes! Plotting such a large dendrogram is tough.

After trying to convince some tools to render a dendrogram this large in a way that is useful, I decided to render it as a graph using Gephi instead. While Gephi is not incredibly intuitive to use and isn’t the most stable piece of software, it is the easiest way to render large graphs. Rendering the dendrogram as a graph throws away the dissimilarity aspect of the rendering, but for a dataset this large, it’s tough to parse that anyway.

To get the best layout possible, I ended up using Gephi’s timeline feature. It was a little tricky to get right, so I thought I’d document the process here in case it’s useful to others.

2018-08-102018-10-07

String.hashCode() is plenty unique

I’ve been running across this article all over Reddit for the past couple of days.

It’s driving me crazy.

The article (rightfully) points out that Java’s humble String.hashCode() method — which maps arbitrary-length String objects to 32-bit int values — has collisions. The article also (wrongfully) makes this sound surprising, and claims that the String.hashCode() algorithm is bad on that basis. In the author’s own words:

No matter what hashing strategy is used, collisions are enevitable (sic) however some hashes are worse than others. You can expect String to be fairly poor.

That’s pretty strongly worded!

The author has demonstrated that String.hashCode() has collisions. However, being a 32-bit String hash function, String.hashCode() will by its nature have many collisions, so the existence of collisions alone doesn’t make it bad.

The author also demonstrated that there are many odd, short String values that generate String.hashCode() collisions. (The article gives many examples, such as !~ and "_.) But because String.hashCode()‘s hash space is small and it runs so fast, it will even be easy to find examples of collisions, so being able to find the collisions doesn’t make String.hashCode() bad either. (And hopefully no one expected it to be a cryptographic hash function to begin with!)

So none of these arguments that String.hashCode() is a “poor” hash function are convincing. Additionally, I have always found (anecdotally) that String.hashCode() manages collisions quite well for Real World Data.

So what does a “poor” hash function look like? And what does a “good” hash function look like? And where does String.hashCode() fall on that spectrum?

Continue reading “String.hashCode() is plenty unique”