From JVMLangSummit
Jump to navigationJump to search

On Wednesday afternoon (9/16), we held an impromptu workshop (rant?) about the increasing semantic gap between programs and their execution, and the consequent degradation in the performance model. The performance model should not to be confused with the performance. While the absolute performance has increased, it has become much more difficult to predict the performance of our programs due to the increasing complexity of processors, VMs, and languages. It may seem counterintuitive, but the quality of the performance and the performance model are inversely related. To increase performance, we complicate our system, resulting in a worse performance model.

No one knew of a solution to the problem, but Brian Goetz opined that it was only a problem for the most sophisticated programmers. That said, we put together a little list of performance gotcahs: constructs whose performance was far worse than it appears. Here is the list in its raw form. Hopefully we can illustrate these things with actual code:

(1) Manual outlining to allow for subsequent inlining.

(2) && and || are branches. Unpredictable branches are very expensive (~100 instructions).

(3) Watch out for field accesses, especially volatile fields. They aren't free, and some are expensive. Some field accesses cause method invocations (e.g., accessing private fields in enclosing classes from nested classes).

(4) Polymorphic method invocations are much more expensive than monomorphic.

(5) Division is expensive (except for some division by constants in some VMs). Both / and % are division operators (i.e., both are expensive). Some VMs, included the HotSpot server compiler, know how to optimize division by certain constants.

(6) Bounds checking elimination in loops is a big win. To enable it, ensure that iteration is by a constant, and when iterating over arrays, that no aliasing can occur.

In the presence of the huge semantic gap, performance measurement becomes even more important than it was. It was noted several profiling tools; they will give you different answers. Some systematic biases due to the fact that they run at "safepoints."

A few clarifications

Remember that this discussion took place at a JVM workshop. I was not saying that the semantic gap is a terrible problem for typical programmers, but it is a very real problem for library designers, language designers, VM implementers, and the like.

I would never suggest that you should avoid all of the potentially expensive operations enumerated above! They can make your code clearer, and first and foremost, you should write short, clear, maintainable, correct programs. Then optimize as necessary (if at all). That was one of my main messages in Effective Java (see, e.g., Item 55 "Optimize Judiciously"). My opinion hasn't changed.

But when it does come time to optimize, the process is greatly complicated by the semantic gap. Consider this: Suppose you carefully write a well-designed microbenchmark that does everything right (e.g., warms up the VM, ensures that computation is not optimized away, times a sufficient amount of computation, does multiple timings to ensure repeatability). You run the benchmark, and see that after warmup, every run takes nearly the same amount of time. Happiness? You run the benchmark again, and again every run takes the same amount of time, but it's a different amount! You run the program twenty times, and see the results clustering into several groups, but always consistent within a program-run. What is going on?

In modern VMs such as HotSpot, the task of deciding what to inline and when to inline it is performed by a background thread (at runtime). This process is called "compilation planning." Because it's multithreaded, it's nondeterministic. Suppose you have code that looks something like this:

    static void foo(int m, int n) {
        for (int i = 0; i < m; i++)

    static void bar(int n)  {
        for (int i = 0; i < n; i++)

The compilation planning thread uses "inlining heuristics" to decide what to inline. It has to make decisions based on local information, but they have global effects. In the simplified example above, inlining the call to bar may preclude inlining the call to baz, and vice-versa. Of course you're far better off only inlining the call to baz (in the inner loop), but the compilation planning thread doesn't have the global knowledge to make the right decision, so you may end up doing the right inlining some of the time and the wrong one some of the time. This explains the confusing behavior we observed above: when you benchmark the code, you get consistency from run to run in a single program-run (after the VM is "warmed up"), but restart the VM and you get a different consistent result.

For more information on this topic, see Georges, Buytaert and Eeckhout's OOPSLA 2007 paper, Statistically Rigorous Java Performance Evaluation.

This is one of many such example of the poor performance model that we live with these days. Mytkowicz and Diwan presented a paper at ASPLOS '09 discussing the problem from a more C/C++/Unix-centric (and philosophical) point-of-view. The paper is entitled Producing Wrong Data Without Doing Anything Obviously Wrong!.

Even if you write only X86 assembly code, the performance model is fiendishly complex. Cliff Click gives a talk on this from a processor-design perspective, entitled This Is Not Your Father's Von Neumann Machine; How Modern Architecture Impacts Your Java Apps.

A part of the problem is due to the underlying memory system. This issue is explored by Ulrich Drepper's 2007 paper, What Every Programmer Should Know About Memory.

Finally, Doug Lea, David Bacon, and David Grove presented a paper at the 2008 SIGPLAN Workshop on Programming Language Curriculum discussing these issues from a teaching standpoint: (Languages and Performance Engineering: Method, Instrumentation, and Pedagogy, citation).

That's five papers in the last couple of years, and there are others. So why is this such a hot topic all of a sudden? Because it's effecting everyone who is trying to do serious performance work these days. It is much more difficult to optimize code than it used to be, and the optimized code is much less likely to remain "optimal" as VMs, processors, and other infrastructure evolve.

In summary, I agree that this topic is not terribly important to typical application programmers in their daily work, and should never be used as an excuse for premature optimization. But it is crucially important to people who build infrastructure and do performance analysis.