Where in the world is Gernot?

Where Has He Been?

DresdenJul. 04
BerlinJun. 30
SydneyJun. 27
San DiegoJun. 19
Palo AltoJun. 18
Mountain ViewJun. 17
Beaverton, ORJun. 16
ChicagoJun. 12
SydneyMay. 16
San FranciscoMay. 15

About the Blog

 
OK Bloggers include:

Engineers,
Developers,
Academics,
Executives,
and a variety of voices from the OK team.

We hope you enjoy this glimpse into our culture...

If you have any questions or comments please email us:

blog@ok-labs.com

Open Kernel Labs Blog

March 27, 2008

High-performance IPC: ARM9 vs ARM11

This issue came up recently. A customer asked questions along the following lines (I'm not making this up, but suspect it's been planted by competitors): You've got this excellent performance on ARM9 cores which is a result of some cool things you did with the architecture. But you can't do those on ARM11 cores, so your performance won't be as good there?

First-off, let me say that we have a ten-year track record of achieving the highest performance on any architecture we focussed our efforts on, and we have no intention of letting off. Our ARM11 kernel will be as fast as it gets, no ifs and buts (although not the first version). Just like on ARM9. OKL4 is the fastest kernel on the planet, and will remain so.

Then what needs to be understood is that with our cool FASS tricks we worked around the disadvantages of the ARM9's somewhat idiosyncratic MMU. So we did in software what on the ARM11 was done by changing the hardware. It's a bit weird that some of our competitors are trying to turn this achievement into an argument against us. These are the same guys who, 8 years after we showed how to do it (and 4 years after we released the code that does it in Linux) still haven't managed to get decent performance on ARM9. Why would anyone buy from them?

Anyway, this leads to an interesting gedanken experiment: Given an optimally-performing ARM9 kernel, what can we expect the optimal performance of an ARM11 kernel to be?

When looking for answers to such of a question we obviously need to understand the architecture, and what the performance impact of relevant features of the architecture are. But we also need to be careful to isolate architecture issues from platform issues. Obviously, when you run actual benchmarks, you aren't doing this on an isolated processor core. You're doing it on a computing platform that at the least contains a memory hierarchy besides the processor. And the performance impact of memory can be significant, and in many cases drown the impact of the architecture.

The best is to try to eliminate platform effects. This is best done by looking at benchmarks that are only affected by architectural properties, not memory. And the best way of doing that is by ensuring that everything that is benchmarked is contained in the L1 cache, so we want to avoid any memory accesses at all.

The operation most critically important for a microkernel (including its use as a hypervisor) is the message-passing (IPC) operation. Fortunately, this one can be run completely from the cache, there is no reason for it to touch memory (on the ARM9 this is only true if you use FASS, of course). And, of course, the way to do this is by running a tight loop of inter-address-space IPC operations—the famous ping-pong benchmark. On the ARM926, OKL4 does a one-way IPC in 151 cycles.

So, what are the relevant properties of the architecture that will impact on this cost when we go to the ARM11? The only really relevant feature is that the ARM11 has a deeper pipeline: 8 stages in the ARM1136 vs 5 stages in the AM926. A deeper pipeline means more pipeline stalls. Specifcally:

  1. The cost of trapping into the kernel (through the syscall exception) is essentially a pipeline flush. This means we lose one cycle per pipeline stage. Hence we can expect the syscall trap to be 3 cycles dearer on the ARM11.
  2. Similarly, returning to user mode will normally flush the pipeline. Add another 3 cycles for ARM11.
  3. The cost of branch mispredicts is essentially the pipeline depth. Again we can assume that each branch mispredict is about 3 cycles dearer on the ARM11.
  4. In addition, we're likely to see a somewhat increased number of pipeline bubbles resulting from data or control dependecies.

Points 1 and 2 above should add a constant 6 cycles to the IPC cost. The others are a bit harder to estimate. However, we can assume that the IPC fast path contains only forward branches, and is written so that they are correctly predicted during fast path operation (this is one of the reasons you have an assembler fast path). The one exception is the first instuction executed in the kernel—the branch from the exception vector to the actual sycall code, which is almost certainly mispredicted. So we can expect point 3 to add another 3 cycles.

This leaves bubbles from pipeline hazards. These are essentially caused by a result being produced in a register in one instruction and consumed by a subsequent instruction. If these instructions are close to each other, then the result is not ready when it is needed, and the processor stalls. With a deeper pipeline the instructions must be further apart, so when executing the same code then the longer pipeline will have more stalls.

From looking at the architecture, it seems that the safe distance for data dependencies is 2 cycles on the ARM926 and 4 cycles on the ARM1136. This implies that a hazard may cause a 1-cycle stall on an ARM926 and up to 3 stall cycles on an ARM1136. However, the ARM1136 makes extensive use of forwarding, which makes outputs available to earlier pipeline stages before they are sent out to registers. Most instructions should therefore not produce any stalls on the ARM11, even if the result is used in the immediately following instruction. The ARM926 probably doesn't use forwarding (precise information is hard to come by), so it seems that the ARM1136 will actually avoid many bubbles of the ARM926.

The main exception are load and store instructions, which generally have 3-cycle latencies on the ARM1136. However, most loads and stores in the fast path avoid bubbles by appropriate instruction scheduling. On a quick glance it seems that there will be less than 20 load/store bubbles on the 1136. This observation was made on the code of OKL4 Release 1.4.1.1. This is a good starting point, as the ARM9 code was fully optimised, while work on ARM11 had not yet commenced.

Let's look at this code a bit closer. From the syscall trap to the return to user mode, I counted 101 instructions, which executed in 151 cycles. Removing the sycall trap, the mispredicted branch from the exception vector, and the userland return, we have 98 instructions taking 151-3*5=136 cycles. Most of those will be data hazards, in arithmetic instructions, plus a few co-processor instructions. In average, an instruction executes in 136/98=1.39 cycles (a CPI of 1.39). Despite the deeper pipeline, the CPI on the ARM11 shouldn't be all that different to that of the ARM9: increased load-to-use latencies being balanced by reduced data dependencies (due to forwarding).

On top of that, the ARM11 doesn't need FASS, and disabling FASS reduces the length of the ARM9 fast path by 18 instructions, a further savings (although there will be a few extra instructions to maintain the ASID register, but fewer than FASS).

So, in summary, we have extra cycles (9) from the syscall, initial branch and return to user, increased stalls on loads, stores and coprocessor instructions, reduced stalls on register dependencies, and reduced instruction count due to removing FASS.

So, the expectation is that the cycle count for a fast IPC system call on the ARM1136 will be in the same ballpark as on the ARM926.

Posted by Gernot Heiser on March 27 at 01:07 AM

Gernot Heiser's avatar

About Gernot Heiser:

Gernot Heiser, Chief Technology Officer, never thought he would be in the business world. Prior to NICTA's creation in 2003, Dr Heiser was a full-time faculty member at the University of New South Wales. However, this die-hard academic couldn’t pass up the opportunity to see the commercialization of this research. Gernot still loves teaching, almost as much as he loves good wine and good food. And anyone will tell you that Gernot knows his wine.

Email Gernot Heiser

Ask GernotPermalink

Back To Top