I spend a lot of time profiling and optimizing Parrot and Rakudo. Even though we have added several
features to Parrot since the 1.0 release in March and the 1.4 release in July,
we've improved speed and memory use by measurable amounts. We haven't yet
reached the point of powerful, orders-of-magnitude optimizations, but we're
preparing for them.
I've written before that I believe the single optimization principle for
writing a compiler and runtime for a dynamic language is to replace as many
aspects of genericity and flexibility with straight-line code as possible. One
optimization from yesterday demonstrates this in a dramatic fashion.
The oofib.pir
benchmark exercises method calls and argument passing. It also creates garbage
collectable objects. It's a little bit heavy on math operations for a general
purpose benchmark, but I like that it performs frequent invocations.
Optimizing those helps almost every real program, as does improving the garbage
collector.
I normally perform profiling work with Callgrind. I've not
found a tool more useful to count raw instruction counts in the processor.
Yesterday I tried Cachegrind instead.
Both tools are usable with the KCachegrind visualizer, making
them more useful. Where Callgrind measures the instructions actually executed,
Cachegrind measures the processor cache behavior of the program.
As modern processors are so much faster than memory (even fast processor caches), sometimes it's worth trading a few cycles for fewer cache misses. Similarly, as modern processors can have deep pipelines and perform speculative execution, rewriting code to avoid branch mispredictions can have a positive benefit on performance as well.
Branch prediction is a processor feature which analyzes a branch condition
-- an if
condition, for example -- and predicts where execution
will go. Speculative execution performs the operations of that expected branch
even before the processor knows that its prediction is correct. If it's
guessed correctly, there's no penalty and a big improvement over having to
wait. If it's guessed incorrectly, it has to throw away work. It's a risk,
but it usually guesses correctly.
Sometimes it needs help.
Parrot's default garbage collector is an optimized-but-clunky stop-the-world mark and sweep collector. We wanted to replace it with a concurrent collector with better throughput before Parrot 1.0, but the time and expertise necessary to refactor the internals to make it possible to change the GC without breaking the code for everyone else was a larger investment than we could produce in time for the 1.0 release. (We're in much better shape now.)
Our garbage collector tracks two similar-but-different types of objects:
STRINGs and PMCs. Think of a STRING as a chunk of binary data with an encoding
and a length and a buffer. Think of a PMC as an object. A STRING is a single,
self-contained unit. A PMC may contain attributes which refer to other PMCs.
There are finer distinctions between them, but that's all you need to
understand now.
To simplify our GC, both STRING and PMC have a shared header called PObj.
This structure contains the garbage collector flags: what type of PObj is it
(STRING or PMC), is it live or free, does it have a custom mark behavior, does
it have custom destruction behavior. Note that both STRINGs and PMCs share the
first two flags, while only PMCs have the latter two.
A stop-the-world mark and sweep collector starts from a root set of objects
that the runtime knows are still alive. It traverses that root set,
marking everything as a live and recursing into any PObj reachable from that
root set, recursing into any PObj reachable from that set, and so on,
until it's marked the entire graph of reachable objects as alive. Then it
sweeps through the pools containing all allocated objects, freeing anything not
marked as live and clearing all flags.
There was a single "mark this PObj as live" function. Whether we wanted to mark a STRING or a PMC, we called the same function, casting the object into a PObj.
The astute can pause here to say "Yes, of course, you're throwing away information there!"
Parrot r41447
added separate functions to mark PMCs and STRINGs as live to let the C
compiler tell us if we made any mistakes about what we marked as live. (We've
had a couple of bugs from expecting the wrong thing.)
My Cachegrind profile showed a huge amount of branch prediction misses in
the PObj marking function. Specifically, the processor could never predict
whether it was marking a STRING or a PMC. As you might expect, marking a
STRING as live means only setting a flag, while marking a PMC requires checking
if it has custom mark behavior and potentially recursing. There's no way to
predict which path through the live object graph the mark phase will take, and
there's no way to predict whether the branch predictor will see a run of PMC,
PMC, PMC and get on a good train of prediction or whether it'll flip back and
forth between PMC, STRING, PMC and continually be wrong. (The mark phase
is deterministic for any code without random allocations, but there's
too much data to predict any pattern.)
As we now had separate functions to mark each, I pulled the guts of the
single mark function into the two helper functions... and reduced the number of
branch mispredictions by over 70% in that benchmark.
For a further optimization, I realized that there's no need even to call the marking function for STRINGs, at least for code in the core of Parrot which can flip the flag directly.
The end result is a little bit more code -- not much, maybe a dozen lines -- but a huge increase in clarity, an improvement in simplicity, and better optimization and performance characteristics. The compiler now helps give us warning messages where it matters most (correctness), and we get better performance at a level normal profiling can't even see.
I even have the temptation to call this pattern "Replace Conditional with
API."