capnp-ocaml blog

capnp-ocaml 2.0: The Road to Unembarrassing Performance

In July I published the first release of capnp-ocaml, an OCaml code generator plugin for the Cap’n Proto serialization framework. This release went out the door in a largely functional state, but without much attention paid to the performance characteristics of generated code. Feeling uncomfortable with this state of affairs, I started work on an implementation of the "suite of silly benchmarks" that ships with the Cap'n Proto C++ compiler. My first run of the "carsales" benchmark looked something like this:

Early carsales benchmark

Oh dear.

I'm under no illusion that OCaml code will be performance-competitive with carefully-written C++; the language trades away a lot of control over allocation and memory layout for comparable gains in safety and expressiveness. But I'm used to seeing a performance degradation in the neighborhood of 3x to 6x. Time to dig out the profiler and get to work.

 7.49%  caml_apply2
 4.64%  camlPres_impl__unsafe_get_ar_1040
 4.40%  camlRes__unsafe_get_1031
 3.23%  caml_page_table_lookup
 3.23%  caml_int64_compare
 2.74%  camlPres_impl__get_1055
 2.60%  camlCapnpRuntime__Message__get_uint8_2095
 2.52%  camlCapnpCarsales__random_car_2865
 2.33%  mark_slice
 1.89%  caml_apply3
 1.75%  camlCapnpRuntime__Message__get_uint16_2099
 1.72%  camlCapnpRuntime__Reader__get_bit_3495
 1.64%  camlFastRand__next_1011
 1.58%  camlCapnpCarsales__car_value_2779
 1.37%  camlCapnpRuntime__Message__set_uint8_2127
 1.34%  camlCapnpRuntime__Reader__get_uint16_3533
 1.24%  camlCapnpRuntime__Common__bounds_check_slice_exn_1807
 1.22%  camlCore_kernel__Int_conversions__is_in_range_1027
 1.12%  camlCapnpRuntime__Common__deref_pointer_2629
 1.12%  camlCapnpRuntime__Builder__get_data_field_7221
 1.10%  camlCapnpRuntime__Common__make_storage_2720
 1.10%  camlCapnpRuntime__Common__decode_pointer_1879
 1.05%  sweep_slice
 1.04%  camlCapnpRuntime__Builder__set_bit_7290
 ...

You know how great it is when your profiler trace resembles a textbook example of the Pareto principle? This doesn't look like that. This is the trace of a program that leaves a little performance on the table in a lot of places. Le sigh.

So I started chipping away. And here's what I did to the benchmark timings over the course of about 40 commits:

Carsales benchmark over time

Benchmarking is Hard

Some of the poor performance had very little to do with Cap'n Proto serialization. The benchmark test cases perform some computations on the data being serialized, both on the client side and on the server side. Some of these computations happen to run a little slow in OCaml.

The benchmark uses an Xorshift RNG, which is known to be a high-performance technique that yields "pretty good" random bits–a natural choice for a benchmark that relies on random numbers only for the generation of some interesting test data.

static inline uint32_t nextFastRand() {
  static uint32_t x = 0x1d2acd47;
  static uint32_t y = 0x58ca3e14;
  static uint32_t z = 0xf563f232;
  static uint32_t w = 0x0bc76199;

  uint32_t tmp = x ^ (x << 11);
  x = y;
  y = z;
  z = w;
  w = w ^ (w >> 19) ^ tmp ^ (tmp >> 8);
  return w;
}

Early on, I could see that my OCaml implementation of this RNG algorithm was not performing very well. Looking at the generated assembly, we can see that access to the persistent RNG state is kind of expensive (a different OCaml data structure might improve the situation, but ideally the RNG state should be completely decoupled from the OCaml heap). Furthermore, the assembly for the Xorshift operations is bloated as an unfortunate result of OCaml's tagged integer representation–operations which are intended to compile down to single instructions get transformed into more complicated sequences that preserve the integer tag bits. GCC generates something much more compact and straightforward for the C implementation.

The decision here was easy. The point of the benchmark is not to test RNG performance; we care a lot more about testing the performance of the capnp-ocaml serializers. The RNG code was trivially replaced with a C extension that performs well. Similarly, OCaml-based implementations of strstr() were not competitive for the short-string data set found in the "catrank" benchmark, and calling straight into glibc was an easy way to make the benchmark results more informative.

Optimization Opportunities

Ignoring these uninteresting cases, what were the most effective optimizations?

Current Performance

So where do things sit now? Here is the current performance for the "carsales" benchmark: Current carsales benchmark This probably requires some explanation:

The "carsales" benchmark exercises many primitive accessors, which is clearly one of the weakest aspects of the OCaml implementation. capnp-ocaml takes about five times as long as capnp-c++ across the board, which is a large improvement over the initial release.

The "catrank" benchmark is heavy on accesses to string fields. For ease of use in OCaml code, code for reading a string from a message will always allocate a new OCaml string and copy data into it–there are no generated accessors that perform a zero-copy access. At the first release, this copy was slow due to the lack of a memcpy()-like primitive. This optimization has since been added, and consequently capnp-ocaml has somewhat stronger performance on "catrank": Current catrank benchmark This plot also makes clear the relatively poor performance of the OCaml "packing" implementation relative to the C++ version. This is unlikely to improve without the use of a C extension, which I would prefer to avoid.

The "eval" benchmark constructs relatively short tree-structured messages which make use of Cap'n Proto's schema-level support for sum types (which capnp-ocaml conveniently represents as OCaml variants). Once again, capnp-ocaml does OK relative to capnp-c++: Current eval benchmark

Conclusion

Idiomatic OCaml code tends to run pretty fast on average, but it can be challenging to squeeze all the cycles out of performance-sensitive code. The GC is a harsh mistress, freeing the programmer from memory management concerns while offering only limited ways to work around allocation-related performance issues.

capnp-ocaml is no longer embarrassingly slow, it is now merely "a bit slower than I would like." Some less formal tests suggested that the current performance level is close to that of bin_prot, although the message sizes are somewhat larger in the absence of compression.