BigQuery RegExp for Matching Single Emoji

If anyone needs a regular expression for matching single Emoji in BigQuery/RE2, this is a good place to start:

(?:[\x{1F300}-\x{1F5FF}]|[\x{1F900}-\x{1F9FF}]|[\x{1F600}-\x{1F64F}]|[\x{1F680}-\x{1F6FF}]|[\x{2600}-\x{26FF}]\x{FE0F}?|[\x{2700}-\x{27BF}]\x{FE0F}?|\x{24C2}\x{FE0F}?|[\x{1F1E6}-\x{1F1FF}]{1,2}|[\x{1F170}\x{1F171}\x{1F17E}\x{1F17F}\x{1F18E}\x{1F191}-\x{1F19A}]\x{FE0F}?|[\\x{0023}\x{002A}\x{0030}-\x{0039}]\x{FE0F}?\x{20E3}|[\x{2194}-\x{2199}\x{21A9}-\x{21AA}]\x{FE0F}?|[\x{2B05}-\x{2B07}\x{2B1B}\x{2B1C}\x{2B50}\x{2B55}]\x{FE0F}?|[\x{2934}\x{2935}]\x{FE0F}?|[\x{3297}\x{3299}]\x{FE0F}?|[\x{1F201}\x{1F202}\x{1F21A}\x{1F22F}\x{1F232}\x{1F23A}\x{1F250}\x{1F251}]\x{FE0F}?|[\x{203C}-\x{2049}]\x{FE0F}?|[\x{00A9}-\x{00AE}]\x{FE0F}?|[\x{2122}\x{2139}]\x{FE0F}?|\x{1F004}\x{FE0F}?|\x{1F0CF}\x{FE0F}?|[\x{231A}\x{231B}\x{2328}\x{23CF}\x{23E9}\x{23F3}\x{23F8}\x{23FA}]\x{FE0F}?)

Inspired by zly394/EmojiRegex on GitHub.

Community-Managed AWS Lambda Base Images for Java 18

I’ve added a new custom base image for Java 18 on Lambda to complement the community Java 17 base images already available. You can find the images on the ECR Public Gallery and DockerHub and the source code on GitHub. If you’ve been waiting to adopt Java 18 until there was Lambda support, then these base images will let you get started.

Continue reading “Community-Managed AWS Lambda Base Images for Java 18”

Announcing My First Commercial API, arachn.io!

I’ve just released my first commercial API, arachn.io. It’s a simple thing, focusing on performing link unwinding and data extraction affordably at scale. It’s also dead simple to use.

Example arachn.io usage

Check out the Free Forever plan to try it out yourself!

Community-Managed AWS Lambda Base Images for Java 17

I’m (finally) upgrading several projects to Java 17, the current LTS version. I like to deploy my Lambda functions as container images because it keeps the devops simple — everything is a container! — but there still isn’t an officially-supported base image Java 17, even though it’s been out for almost a year. So I made one and released it to the community. You can find the images on ECR Public Gallery and DockerHub, and the source code is on GitHub. If you’ve been waiting to adopt Java 17 until there was Lambda support, then these base images will let you get started.

Continue reading “Community-Managed AWS Lambda Base Images for Java 17”

Litecene: Full-Text Search for Google BigQuery

I just released the first beta version of litecene, a Java library that implements a common boolean search syntax for full-text search, with its first transpiler for BigQuery.

Litecene makes Searching Text in BigQuery Easy!

Litecene Syntax

Litecene uses a simple, user-friendly syntax that is derived from Lucene query syntax that should be familiar to most users. It includes clauses for term with wildcard, phrase with proximity, AND, OR, NOT, and grouping.

As an example, this might be a good Litecene query to identify social media posts talking about common ways people user their smartphones:

(smartphone OR "smart phone" OR iphone OR "apple phone" OR android OR "google phone" OR "windows phone" OR "phone app"~8) AND (call* OR dial* OR app OR surf* OR brows* OR camera* OR pic* OR selfie)

The syntax is documented in more detail here.

Continue reading “Litecene: Full-Text Search for Google BigQuery”

Fundamentals of Software Performance Analysis Part III — Optimizing Memory Allocation Performance

This is the third post in a three-post series covering the fundamentals of software performance analysis. You can find the introduction here. You can find Part II here. You can find the companion GitHub repository here.

Part II covered the process of using a profiler (VisualVM) to identify “hot spots” where programs spend time and then optimizing program code to improve program wall clock performance.

This post will cover the process of using a profiler to identify memory allocation hot spots and then optimizing program code to improve memory allocation performance. It might be useful to refer back to Part II if you need a refresher on how to use VisualVM or the optimization workflow.

Continue reading “Fundamentals of Software Performance Analysis Part III — Optimizing Memory Allocation Performance”

Fundamentals of Software Optimization Part II — Optimizing Wall Clock Performance

This is the second post in a three-post series covering the fundamentals of software optimization. You can find the introduction here. You can find Part I here. You can find Part III here. You can find the companion GitHub repository here.

Part I covered the development of the benchmark, which is the “meter stick” for measuring performance, and established baseline performance using the benchmark.

This post will cover the high-level optimization process, including how to use profile software performance using VisualVM to identify “hot spots” in the code, make code changes to improve hot spot performance, and evaluate performance changes using a benchmark.

Continue reading “Fundamentals of Software Optimization Part II — Optimizing Wall Clock Performance”

Fundamentals of Software Optimization Part I — Benchmarking

This is the first post in a three-post series covering the fundamentals of software optimization. You can find the introduction here. You can find Part II here. You can find the companion GitHub repository here.

The introduction motivated why software optimization is a problem that matters, reflected on the fundamental connection between the scientific method and software performance analysis, and documented the (informal) optimization goal for this series: to optimize the production workflow’s wall clock performance and memory usage performance “a lot.”

This post will cover the theory and practice of designing, building, and running a benchmark to measure program performance using JMH and establishing the benchmark’s baseline performance measurements.

Continue reading “Fundamentals of Software Optimization Part I — Benchmarking”