Fundamentals of Software Optimization Part I — Benchmarking

This is the first post in a three-post series covering the fundamentals of software optimization. You can find the introduction here. You can find Part II here. You can find the companion GitHub repository here.

The introduction motivated why software optimization is a problem that matters, reflected on the fundamental connection between the scientific method and software performance analysis, and documented the (informal) optimization goal for this series: to optimize the production workflow’s wall clock performance and memory usage performance “a lot.”

This post will cover the theory and practice of designing, building, and running a benchmark to measure program performance using JMH and establishing the benchmark’s baseline performance measurements.

The Theory of Benchmarking

Software optimization is the process of changing program code to improve program performance on some dimension(s), e.g. wall clock time, perceived responsiveness, etc. The benchmark is the “meter stick” used to measure performance, and is comprised of a standardized workload and performance metrics that define objective measurements of the performance dimension(s) being optimized. It is this “definition of performance” used to determine if a code change has made performance better or worse.

The first measurements generated by the benchmark are the baseline performance measurements of the system. All future performance measurements will be judged relative to this initial performance record.

The benchmark workload should match the production workload being optimized as closely as possible. Otherwise, program changes that result in performance gains in the benchmark may not result in performance gains in production.

In science, researchers change one variable in a system at a time and then observe how changing that variable affects the behavior of that system to understand and ultimately manipulate the system. This design ethos is the same principle at work. The fundamental principle of variable isolation will drive the design and execution of the benchmark throughout the rest of this discussion.

The Goal

Recall from the introduction that the production workload and optimization goal are as follows:

Production Workload

The production workload is the following API endpoint:

  /**
   * @param text Text from a social media post
   * @return The number of emoji in the given text
   */
  @GET
  public Integer countEmoji(String text) {
    int count=0;
    
    GraphemeMatcher m=new GraphemeMatcher(text);
    while(m.find()) {
      count = count+1;
    }

    return count;
  }

Optimization Goal

The optimization goal is: to optimize the production workload’s wall clock performance and memory usage performance “a lot.” (Normally, the optimization goal would include a precise performance target and a connection to some business value, but for this series the goal is kept intentionally simple.)

Building the Benchmark

This series will use the excellent JMH project to build the benchmark. JMH was chosen because it is a battle-tested benchmarking harness developed by the core Java team that supports all the major benchmarking best practices (multiple forks, warmup period, setup phase, blackholes, etc.) out of the box.

Recall that the primary goal when designing the benchmark workload is to make it as similar to the production workload as possible. Given the production workload, a reasonable benchmark workload would be:

@Fork(value = 3)                   // Run 3 executions in different processes
@Warmup(iterations = 5)            // In each fork, run 5 iterations to warm up
@Measurement(iterations = 5)       // In each fork, run 5 iterations to measure
@OutputTimeUnit(TimeUnit.SECONDS)  // Report the output time in seconds
@BenchmarkMode(Mode.Throughput)    // Our metric is throughput
@State(Scope.Benchmark)            // The initialization covers a whole fork
public class GraphemeMatcherBenchmark {
  /**
   * Contains exactly 1MB of "random" data sampled from Twitter streaming API. 
   * Visually confirmed to be emoji-rich.
   */
  public String text;

  /**
   * The data structure to use to scan for emoji
   */
  public GraphemeTrie trie;

  @Setup
  public void setupGraphemeMatcherBenchmark() throws IOException {
    // Load the text to process during our benchmark
    try (
        InputStream in = new GZIPInputStream(Resources.getResource("tweets.txt.gz").openStream())) {
      text = new String(ByteStreams.toByteArray(in), StandardCharsets.UTF_8);
    }

    // Load the data structure we need to scan for emoji
    trie = DefaultGraphemeTrie.fromGraphemeData(Graphemes.getGraphemeData());
  }

  @Benchmark
  public void tweets(Blackhole blackhole) {
    // Let's count our matches
    int count = 0;

    // For each emoji match, increment our count
    GraphemeMatcher m = new GraphemeMatcher(trie, text);
    while (m.find()) {
      count = count + 1;
    }

    // In Java, if work happens that does not result in externally-visible side
    // effects, then the JIT can optimize it out. JMH provides the Blackhole,
    // which can be used to generate a side effect that prevents the benchmark 
    // from being optimized away. This guarantees we're measuring what we think
    // we're measuring.
    blackhole.consume(count);
  }
}

JMH benchmarks have two phases: the setup phase, and the workload phase. Only the workload phase is measured and reported on as the performance measurements.

In the setup phase — so before measurement starts — the benchmark reads a bolus of social media posts and stores it in memory, and loads the data structure required to scan for emoji. Isolating this initialization work in the setup phase allows the production workload to be measured with as much accuracy as possible. Keeping the social media data in memory avoids adding disk I/O to the workload. Using social media posts — as opposed to newspaper articles, or Shakespearean sonnets — is part of keeping the benchmark workload as similar to the production workload as possible.

In the actual measured benchmark, the code is as similar to the production workload as possible. Using a blackhole is an important best practice that ensures the compiler and JIT don’t simply optimize the benchmark code away. A benchmark workload that does nothing goes very fast, but it also doesn’t help optimize the production workload!

The annotations on the benchmark class control how the benchmark is actually executed. This configuration executes the workload multiple times in multiple forks to maximize the isolation between benchmark runs using processes and evaluate the consistency of benchmark performance. It also includes a “warmup” period in each fork to allow the performance to reach steady state (JIT compilation, cache residency, etc.) before measurement begins. It also indicates that benchmark setup can be run once per fork (as opposed to once per iteration, for example), and that performance metrics should be reported in throughput mode, which is important to the selected performance metrics.

Choosing Performance Metrics

Given the optimization goal, the selected performance metrics are:

  • Text throughput. This metric will measure how fast text is processed on the “wall clock time” dimension. Throughput is defined as units of work done per unit time, so units will be in bytes per second, at some magnitude. Choosing a size of 1MB for the bolus of social media text processed in each benchmark iteration causes the benchmark to report the throughput measurement directly in MB/s.
  • Memory allocation throughput. This metric will measure how much memory is allocated during text processing. It is tempting to choose a metric like “memory allocation per execution,” but that metric would be hard to measure and would change depending on the amount of text in each call, so would not be stable. Happily, JMH exposes an option to report allocation in MB/s out of the box. This will be an accurate measurement of the given metric because the benchmark consists almost entirely of text processing.

Building the Environment

Recall that the primary goal when designing the benchmark workload is to make it as similar to the production workload as possible. This principle extends to the execution environment as well. (For example, it’s not hard to see how optimizing source code performance on an ARM chip may not translate to the same performance gains on an x86 chip!) The same argument applies to CPU differences, memory amount differences, memory speed differences, etc.

Other parts of the optimization workflow can be performed locally, but because benchmarks are the definition of performance, it’s important that they be run on an environment as close to the production environment as possible.

In a professional scenario, the software optimization plan would include detailed information about the specifications and configuration of the production environment. Because there is no production environment in this example case, a simple environment on a cloud server will be used instead. This approach allows readers to reproduce measurements on their own if they choose.

Hardware specs and software configuration both affect performance. Therefore, to guarantee consistency when reproducing benchmark outcomes, it is important to be able to reproduce the benchmark environment exactly. Therefore, it’s a best practice to create benchmark environments using some standardized process. This can be a well-documented manual process, an automation platform like chef, or a IaC tool like CloudFormation or Terraform.

To keep things simple, this series will use an AWS EC2 a1.medium server initialized with the following script:

#!/bin/bash
sudo apt-get update
sudo apt-get install openjdk-17-jre-headless

Preparing the Environment

To ensure that the benchmark is always run the same way every time, the following script is used to run the benchmark:

#!/bin/bash
java -Xms1024m -Xmx1024m -jar foso-benchmark.jar -prof gc

The command line gives the benchmark the following instructions:

  • -Xms1024m — Use a minimum heap size of 1 GiB, or 1024 MiB
  • -Xmx1024m — Use a maximum heap size of 1GiB, or 1024 MiB
  • -prof gc — Also profile GC usage, which will measure memory allocation throughput

Setting the minimum and maximum heap sizes to the same value guarantees that differences in heap growth between runs will not affect performance measurements. This would likely be handled adequately by the warmup iterations of the benchmark, but in benchmarking, whenever there is an opportunity to isolate a variable that could affect performance, take it!

Now that the benchmark has been written and the environment is ready, it’s time to build the benchmark and copy it to the benchmark server. From the root directory of the project:

$ mvn clean compile install
$ scp -i /path/to/key.pem foso-benchmark/target/foso-benchmark.jar ubuntu@ec2-hostname-goes-here.compute.amazonaws.com:/home/ubuntu
$ scp -i /path/to/key.pem run.sh ubuntu@ec2-hostname-goes-here.compute.amazonaws.com:/home/ubuntu

Running the Benchmark

It’s finally time to run the benchmark! Here are the results produced by run.sh:

# JMH version: 1.35
# VM version: JDK 17.0.2, OpenJDK 64-Bit Server VM, 17.0.2+8-Ubuntu-120.04
# VM invoker: /usr/lib/jvm/java-17-openjdk-arm64/bin/java
# VM options: -Xms1024m -Xmx1024m
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.sigpwned.foso.benchmark.GraphemeMatcherBenchmark.tweets

# Run progress: 0.00% complete, ETA 00:05:00
# Fork: 1 of 3
# Warmup Iteration   1: 16.818 ops/s
# Warmup Iteration   2: 19.120 ops/s
# Warmup Iteration   3: 19.288 ops/s
# Warmup Iteration   4: 19.381 ops/s
# Warmup Iteration   5: 19.311 ops/s
Iteration   1: 18.127 ops/s
                 ·gc.alloc.rate:            47.152 MB/sec
                 ·gc.alloc.rate.norm:       2863877.758 B/op
                 ·gc.churn.Eden_Space:      51.804 MB/sec
                 ·gc.churn.Eden_Space.norm: 3146448.176 B/op
                 ·gc.count:                 2.000 counts
                 ·gc.time:                  15.000 ms

Iteration   2: 19.105 ops/s
                 ·gc.alloc.rate:            49.702 MB/sec
                 ·gc.alloc.rate.norm:       2863861.792 B/op
                 ·gc.churn.Eden_Space:      51.762 MB/sec
                 ·gc.churn.Eden_Space.norm: 2982570.667 B/op
                 ·gc.count:                 2.000 counts
                 ·gc.time:                  14.000 ms

Iteration   3: 19.178 ops/s
                 ·gc.alloc.rate:            49.882 MB/sec
                 ·gc.alloc.rate.norm:       2863861.792 B/op
                 ·gc.churn.Eden_Space:      51.949 MB/sec
                 ·gc.churn.Eden_Space.norm: 2982570.667 B/op
                 ·gc.count:                 2.000 counts
                 ·gc.time:                  16.000 ms

Iteration   4: 19.099 ops/s
                 ·gc.alloc.rate:                49.690 MB/sec
                 ·gc.alloc.rate.norm:           2864042.250 B/op
                 ·gc.churn.Eden_Space:          51.747 MB/sec
                 ·gc.churn.Eden_Space.norm:     2982570.667 B/op
                 ·gc.churn.Survivor_Space:      0.477 MB/sec
                 ·gc.churn.Survivor_Space.norm: 27482.958 B/op
                 ·gc.count:                     2.000 counts
                 ·gc.time:                      12.000 ms

Iteration   5: 19.343 ops/s
                 ·gc.alloc.rate:                50.314 MB/sec
                 ·gc.alloc.rate.norm:           2863859.711 B/op
                 ·gc.churn.Eden_Space:          51.859 MB/sec
                 ·gc.churn.Eden_Space.norm:     2951822.515 B/op
                 ·gc.churn.Survivor_Space:      ≈ 10⁻⁴ MB/sec
                 ·gc.churn.Survivor_Space.norm: 8.454 B/op
                 ·gc.count:                     2.000 counts
                 ·gc.time:                      1.000 ms


# Run progress: 33.33% complete, ETA 00:03:34
# Fork: 2 of 3
# Warmup Iteration   1: 17.525 ops/s
# Warmup Iteration   2: 18.185 ops/s
# Warmup Iteration   3: 18.432 ops/s
# Warmup Iteration   4: 18.333 ops/s
# Warmup Iteration   5: 18.626 ops/s
Iteration   1: 19.363 ops/s
                 ·gc.alloc.rate:            50.361 MB/sec
                 ·gc.alloc.rate.norm:       2863859.588 B/op
                 ·gc.churn.Eden_Space:      51.908 MB/sec
                 ·gc.churn.Eden_Space.norm: 2951822.515 B/op
                 ·gc.count:                 2.000 counts
                 ·gc.time:                  15.000 ms

Iteration   2: 19.016 ops/s
                 ·gc.alloc.rate:            49.470 MB/sec
                 ·gc.alloc.rate.norm:       2863863.246 B/op
                 ·gc.churn.Eden_Space:      51.790 MB/sec
                 ·gc.churn.Eden_Space.norm: 2998186.220 B/op
                 ·gc.count:                 2.000 counts
                 ·gc.time:                  16.000 ms

Iteration   3: 19.625 ops/s
                 ·gc.alloc.rate:            51.050 MB/sec
                 ·gc.alloc.rate.norm:       2863854.741 B/op
                 ·gc.churn.Eden_Space:      51.817 MB/sec
                 ·gc.churn.Eden_Space.norm: 2906870.904 B/op
                 ·gc.count:                 2.000 counts
                 ·gc.time:                  15.000 ms

Iteration   4: 19.487 ops/s
                 ·gc.alloc.rate:                50.687 MB/sec
                 ·gc.alloc.rate.norm:           2864035.200 B/op
                 ·gc.churn.Eden_Space:          51.972 MB/sec
                 ·gc.churn.Eden_Space.norm:     2936684.964 B/op
                 ·gc.churn.Survivor_Space:      0.479 MB/sec
                 ·gc.churn.Survivor_Space.norm: 27054.318 B/op
                 ·gc.count:                     2.000 counts
                 ·gc.time:                      12.000 ms

Iteration   5: 19.436 ops/s
                 ·gc.alloc.rate:                50.555 MB/sec
                 ·gc.alloc.rate.norm:           2863858.297 B/op
                 ·gc.churn.Eden_Space:          51.841 MB/sec
                 ·gc.churn.Eden_Space.norm:     2936684.964 B/op
                 ·gc.churn.Survivor_Space:      ≈ 10⁻⁴ MB/sec
                 ·gc.churn.Survivor_Space.norm: 8.410 B/op
                 ·gc.count:                     2.000 counts
                 ·gc.time:                      1.000 ms


# Run progress: 66.67% complete, ETA 00:01:47
# Fork: 3 of 3
# Warmup Iteration   1: 17.377 ops/s
# Warmup Iteration   2: 19.740 ops/s
# Warmup Iteration   3: 19.108 ops/s
# Warmup Iteration   4: 19.570 ops/s
# Warmup Iteration   5: 19.556 ops/s
Iteration   1: 19.765 ops/s
                 ·gc.alloc.rate:            51.411 MB/sec
                 ·gc.alloc.rate.norm:       2863854.020 B/op
                 ·gc.churn.Eden_Space:      51.920 MB/sec
                 ·gc.churn.Eden_Space.norm: 2892189.737 B/op
                 ·gc.count:                 2.000 counts
                 ·gc.time:                  15.000 ms

Iteration   2: 19.761 ops/s
                 ·gc.alloc.rate:            51.399 MB/sec
                 ·gc.alloc.rate.norm:       2863853.374 B/op
                 ·gc.churn.Eden_Space:      51.908 MB/sec
                 ·gc.churn.Eden_Space.norm: 2892189.737 B/op
                 ·gc.count:                 2.000 counts
                 ·gc.time:                  15.000 ms

Iteration   3: 19.726 ops/s
                 ·gc.alloc.rate:            51.313 MB/sec
                 ·gc.alloc.rate.norm:       2863853.374 B/op
                 ·gc.churn.Eden_Space:      51.820 MB/sec
                 ·gc.churn.Eden_Space.norm: 2892189.737 B/op
                 ·gc.count:                 2.000 counts
                 ·gc.time:                  15.000 ms

Iteration   4: 19.831 ops/s
                 ·gc.alloc.rate:                51.590 MB/sec
                 ·gc.alloc.rate.norm:           2864026.131 B/op
                 ·gc.churn.Eden_Space:          51.836 MB/sec
                 ·gc.churn.Eden_Space.norm:     2877656.121 B/op
                 ·gc.churn.Survivor_Space:      0.478 MB/sec
                 ·gc.churn.Survivor_Space.norm: 26512.804 B/op
                 ·gc.count:                     2.000 counts
                 ·gc.time:                      12.000 ms

Iteration   5: 19.885 ops/s
                 ·gc.alloc.rate:                51.709 MB/sec
                 ·gc.alloc.rate.norm:           2863852.784 B/op
                 ·gc.churn.Eden_Space:          51.958 MB/sec
                 ·gc.churn.Eden_Space.norm:     2877656.121 B/op
                 ·gc.churn.Survivor_Space:      ≈ 10⁻⁴ MB/sec
                 ·gc.churn.Survivor_Space.norm: 8.241 B/op
                 ·gc.count:                     2.000 counts
                 ·gc.time:                      1.000 ms



Result "com.sigpwned.foso.benchmark.GraphemeMatcherBenchmark.tweets":
  19.383 ±(99.9%) 0.482 ops/s [Average]
  (min, avg, max) = (18.127, 19.383, 19.885), stdev = 0.451
  CI (99.9%): [18.901, 19.865] (assumes normal distribution)

Secondary result "com.sigpwned.foso.benchmark.GraphemeMatcherBenchmark.tweets:·gc.alloc.rate":
  50.419 ±(99.9%) 1.252 MB/sec [Average]
  (min, avg, max) = (47.152, 50.419, 51.709), stdev = 1.171
  CI (99.9%): [49.167, 51.671] (assumes normal distribution)

Secondary result "com.sigpwned.foso.benchmark.GraphemeMatcherBenchmark.tweets:·gc.alloc.rate.norm":
  2863894.270 ±(99.9%) 77.950 B/op [Average]
  (min, avg, max) = (2863852.784, 2863894.270, 2864042.250), stdev = 72.914
  CI (99.9%): [2863816.321, 2863972.220] (assumes normal distribution)

Secondary result "com.sigpwned.foso.benchmark.GraphemeMatcherBenchmark.tweets:·gc.churn.Eden_Space":
  51.859 ±(99.9%) 0.077 MB/sec [Average]
  (min, avg, max) = (51.747, 51.859, 51.972), stdev = 0.072
  CI (99.9%): [51.782, 51.937] (assumes normal distribution)

Secondary result "com.sigpwned.foso.benchmark.GraphemeMatcherBenchmark.tweets:·gc.churn.Eden_Space.norm":
  2947207.581 ±(99.9%) 73787.357 B/op [Average]
  (min, avg, max) = (2877656.121, 2947207.581, 3146448.176), stdev = 69020.739
  CI (99.9%): [2873420.224, 3020994.937] (assumes normal distribution)

Secondary result "com.sigpwned.foso.benchmark.GraphemeMatcherBenchmark.tweets:·gc.churn.Survivor_Space":
  0.096 ±(99.9%) 0.211 MB/sec [Average]
  (min, avg, max) = (≈ 0, 0.096, 0.479), stdev = 0.198
  CI (99.9%): [≈ 0, 0.307] (assumes normal distribution)

Secondary result "com.sigpwned.foso.benchmark.GraphemeMatcherBenchmark.tweets:·gc.churn.Survivor_Space.norm":
  5405.012 ±(99.9%) 11959.173 B/op [Average]
  (min, avg, max) = (≈ 0, 5405.012, 27482.958), stdev = 11186.617
  CI (99.9%): [≈ 0, 17364.185] (assumes normal distribution)

Secondary result "com.sigpwned.foso.benchmark.GraphemeMatcherBenchmark.tweets:·gc.count":
  30.000 ±(99.9%) 0.001 counts [Sum]
  (min, avg, max) = (2.000, 2.000, 2.000), stdev = 0.001
  CI (99.9%): [30.000, 30.000] (assumes normal distribution)

Secondary result "com.sigpwned.foso.benchmark.GraphemeMatcherBenchmark.tweets:·gc.time":
  175.000 ±(99.9%) 0.001 ms [Sum]
  (min, avg, max) = (1.000, 11.667, 16.000), stdev = 5.678
  CI (99.9%): [175.000, 175.000] (assumes normal distribution)


# Run complete. Total time: 00:05:21

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

NOTE: Current JVM experimentally supports Compiler Blackholes, and they are in use. Please exercise
extra caution when trusting the results, look into the generated code to check the benchmark still
works, and factor in a small probability of new VM bugs. Additionally, while comparisons between
different JVMs are already problematic, the performance difference caused by different Blackhole
modes can be very significant. Please make sure you use the consistent Blackhole mode for comparisons.

Benchmark                                                       Mode  Cnt        Score       Error   Units
GraphemeMatcherBenchmark.tweets                                thrpt   15       19.383 ±     0.482   ops/s
GraphemeMatcherBenchmark.tweets:·gc.alloc.rate                 thrpt   15       50.419 ±     1.252  MB/sec
GraphemeMatcherBenchmark.tweets:·gc.alloc.rate.norm            thrpt   15  2863894.270 ±    77.950    B/op
GraphemeMatcherBenchmark.tweets:·gc.churn.Eden_Space           thrpt   15       51.859 ±     0.077  MB/sec
GraphemeMatcherBenchmark.tweets:·gc.churn.Eden_Space.norm      thrpt   15  2947207.581 ± 73787.357    B/op
GraphemeMatcherBenchmark.tweets:·gc.churn.Survivor_Space       thrpt   15        0.096 ±     0.211  MB/sec
GraphemeMatcherBenchmark.tweets:·gc.churn.Survivor_Space.norm  thrpt   15     5405.012 ± 11959.173    B/op
GraphemeMatcherBenchmark.tweets:·gc.count                      thrpt   15       30.000              counts
GraphemeMatcherBenchmark.tweets:·gc.time                       thrpt   15      175.000                  ms

Understanding Benchmark Outputs

It is important to remember that benchmark outputs are just data, and it is up to the analyst to ensure that the benchmark actually measures performance, determine the extent to which the data are reliable, and then to interpret the data to determine the performance measurements.

Evaluating Benchmark Quality

There are many ways to evaluate benchmark quality. This section explores some obvious takeaways from this particular outcome. The first performance metric (wall clock time) corresponds to the primary metric reported by JMH, and the second performance metric (memory allocation rate) corresponds to the gc.alloc.rate measurement. The most relevant parts of the benchmark (the summaries for those two measurements) are reproduced here:

Result "com.sigpwned.foso.benchmark.GraphemeMatcherBenchmark.tweets":
  19.383 ±(99.9%) 0.482 ops/s [Average]
  (min, avg, max) = (18.127, 19.383, 19.885), stdev = 0.451
  CI (99.9%): [18.901, 19.865] (assumes normal distribution)

Secondary result "com.sigpwned.foso.benchmark.GraphemeMatcherBenchmark.tweets:·gc.alloc.rate":
  50.419 ±(99.9%) 1.252 MB/sec [Average]
  (min, avg, max) = (47.152, 50.419, 51.709), stdev = 1.171
  CI (99.9%): [49.167, 51.671] (assumes normal distribution)

For both metrics, note that the min and max are close together, and that the standard deviation is less than 5% of the mean. This indicates that the measurements are consistent, which is a desirable quality in benchmarks.

It’s believable that a small cloud server could process text at roughly 20MB/s, although certainly there is some room for improvement. So the first metric has face validity.

If the program processes text at 20MB/s, then it’s also believable that it might allocate memory at a rate of 50MB/s to do so. This would simply require the processing of each byte to result in the allocation of 2.5B on average. So the second metric has face validity, too.

Interpreting Benchmark Results

Now that the benchmark results have been determined to be reasonable, it’s time to interpret them in the context of the performance metrics. From the results:

Benchmark                         Mode  Cnt   Score   Error  Units
GraphemeMatcherBenchmark.tweets                                thrpt   15       19.383 ±     0.482   ops/s
GraphemeMatcherBenchmark.tweets:·gc.alloc.rate                 thrpt   15       50.419 ±     1.252  MB/sec

For the first metric, given that the benchmark processes exactly 1MB of text, and the code runs at a rate of 19.322 ops/s, the measurement is 19.322 MB/s.

For the second metric, the benchmark reports the measurement directly: 50.419MB/s.

So the baseline performance measurements are:

  1. 19.322 MB/s
  2. 50.419 MB/s

Success!

Next Steps

Now that benchmark is ready and baseline performance measurements are available, we will start attempting to optimize the code and improve the first performance metric in Part II – Optimizing CPU Usage. See you there!