Open Kernel Labs Blog

December 31, 2008

Small Code

A project that I’m working on at the moment calls for a very small footprint. This post is about how to make code really small for the ARM architecture.

As you can probably guess, I’m interested in operating system code, so as an example, I’ve taken a very simple piece of operating system code, and stripped it right back to demonstrate some of the techniques to use when optimising for space. So here is the snippet of code we are going to optimise:

01 struct thread {
02     unsigned int notify_bits;
03 };
04
05 unsigned int
06 poll(struct thread *thread, unsigned int mask)
07 {
08     unsigned int result;
09     result = thread->notify_bits & mask;
10     if (result) {
11         thread->notify_bits &= ~result;
12     }
13     return result;
14 }

In this very simple operating system we have threads (thread data is stored in struct thread). Each thread has a set of 32 signals (encoded in a single word notify_bits). This poll is used by a thread to determine if it has been sent any signals. The mask parameter is the set of signals that the thread is interested in checking. So, a thread can check if a single signal has been thread, or if any signal has been set, or if a specific subset of signals has been set. The function returns the signals that are available (which is simply the bit-wise and of notify_bits and mask). It is important that the function clears any signals that have been returned. This makes sure that if poll is called twice the same signals are not returned. This is achieved in lines 10—12.

So, our goal is to try and get this code as small as possible. And every byte counts! First off, we just try and compile this with the standard ARM gcc compiler. I’m using version 4.2.2. So we start with: $ arm-elf-gcc -c poll.c -o poll.o. We can then use object dump to work out what the compiler did: $ arm-elf-objdump -dl poll.o.

00000000 <poll&ht;:
poll():
   0:    e1a0c00d     mov    ip, sp
   4:    e92dd800     push    {fp, ip, lr, pc}
   8:    e24cb004     sub    fp, ip, #4    ; 0x4
   c:    e24dd00c     sub    sp, sp, #12    ; 0xc
  10:    e50b0014     str    r0, [fp, #-20]
  14:    e50b1018     str    r1, [fp, #-24]
  18:    e51b3014     ldr    r3, [fp, #-20]
  1c:    e5932000     ldr    r2, [r3]
  20:    e51b3018     ldr    r3, [fp, #-24]
  24:    e0023003     and    r3, r2, r3
  28:    e50b3010     str    r3, [fp, #-16]
  2c:    e51b3010     ldr    r3, [fp, #-16]
  30:    e3530000     cmp    r3, #0    ; 0x0
  34:    0a000006     beq    54 <poll+0x54>
  38:    e51b3014     ldr    r3, [fp, #-20]
  3c:    e5932000     ldr    r2, [r3]
  40:    e51b3010     ldr    r3, [fp, #-16]
  44:    e1e03003     mvn    r3, r3
  48:    e0022003     and    r2, r2, r3
  4c:    e51b3014     ldr    r3, [fp, #-20]
  50:    e5832000     str    r2, [r3]
  54:    e51b3010     ldr    r3, [fp, #-16]
  58:    e1a00003     mov    r0, r3
  5c:    e24bd00c     sub    sp, fp, #12    ; 0xc
  60:    e89da800     ldm    sp, {fp, sp, pc}

So, our first go gets us 100 bytes of code. Which is 10 bytes (or 2.5 instructions) for each of our lines of code. We should be able to do better. Well, the first thing is we should try is to use optimisation: $ arm-elf-gcc -c poll.c -o poll.o -O2. This gives us a much better code generation output:

00000000 <poll>:
poll():
   0:    e5902000     ldr    r2, [r0]
   4:    e1a0c000     mov    ip, r0
   8:    e0110002     ands    r0, r1, r2
   c:    11e03000     mvnne    r3, r0
  10:    10033002     andne    r3, r3, r2
  14:    158c3000     strne    r3, [ip]
  18:    e12fff1e     bx    lr

So this got us down to 28 bytes. A factor 4 improvement for one compiler flag, not bad. Now, -O2 does some standard omptimisations, but -Os, will do optimisations specifically for reducing the amount of code. So trying: $ arm-elf-gcc -c poll.c -o poll.o -Os. This gives a little bit better code-gen:

00000000 <poll>:
poll():
   0:    e5903000     ldr    r3, [r0]
   4:    e1a02000     mov    r2, r0
   8:    e0110003     ands    r0, r1, r3
   c:    11c33000     bicne    r3, r3, r0
  10:    15823000     strne    r3, [r2]
  14:    e12fff1e     bx    lr

Down to 24 bytes (6 instructions), is pretty good. Now, as you can see the generated code has 32-bits per instruction. The some of the ARM architectures have two distinct instruction sets, ARM and Thumb. The Thumb instruction set uses 16-bit per instruction, instead of 32-bit. This denser instruction set can enable much smaller code sizes. Of course there is a trade-off here. The functionality of the 16-bit instructions is going to be less than the 32-bit instructions. But lets give it a try. At the same time, we will tell the compiler the exact CPU we want to compile for (which is the ARM7TDMI-S) in our case. The compiler line is: $ arm-elf-gcc -c poll.c -o poll.o -Os -mcpu=arm7tdmi -mthumb. Which produces code like:

00000000 <poll>:
poll():
   0:    6803          ldr    r3, [r0, #0]
   2:    1c02          adds    r2, r0, #0
   4:    1c08          adds    r0, r1, #0
   6:    4018          ands    r0, r3
   8:    d001          beq.n    e <poll+0xe>
   a:    4383          bics    r3, r0
   c:    6013          str    r3, [r2, #0]
   e:    4770          bx    lr

So, now we are down to 16 bytes, so in Thumb we need 8 instructions (2 more than ARM), but each is only 2 bytes, not 4, so we end up with a 1/3 improvement. To get any further, we need to start looking at our code again, and see if there are ways of improving the code. Looking at the code again:

00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03     unsigned int result;
04     result = thread->notify_bits & mask;
05     if (result) {
06         thread->notify_bits &= ~result;
07     }
08     return result;
09 }

You may notice that the branch instruction on line 5, you may notice that this is actually redundant. If result is zero, then ~result well be 0xffffffff. Given this thread->notify_bits &= 0xffffffff will not change the value of thread->notify_bits. So, we can reduce this to:

00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03     unsigned int result;
04     result = thread->notify_bits & mask;
05     thread->notify_bits &= ~result;
06     return result;
07 }

When we compile this we get down to:

00000000 <poll>:
poll():
   0:    6803          ldr    r3, [r0, #0]
   2:    4019          ands    r1, r3
   4:    438b          bics    r3, r1
   6:    6003          str    r3, [r0, #0]
   8:    1c08          adds    r0, r1, #0
   a:    4770          bx    lr

This gets us down to 6 instructions (12 bytes). Pretty good since we started at 100 bytes. Now lets look at the object code in a little bit more detail. If you look at address 0x8, the instruction simply moves register r1 into register r0 so that it in the right place for return. (Note: The ARM ABI has the return value stored in r0). This seems like a bit of a waste, it would be good if there was a way we could have the value stored in r0 and not waste an instruction just moving values between registers. Now, to get a better understanding of what the instructions are doing, I’m going to slightly rewrite the code, and then compile with debugging, so we can see how the generated code matches up with the source code. So first, lets rewrite the code a little bit:

00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03     unsigned int tmp_notify_bits = thread->notify_bits;
04     mask &= tmp_notify_bits;
05     tmp_notify_bits &= ~mask;
06     thread->notify_bits = tmp_notify_bits;
07     return mask;
08 }

You should convince yourself that this code is equivalent to the existing code. (The code generated is identical). So, we can now line up the source code with the generated code. On line 03 (unsigned int tmp_notify_bits = thread->notify_bits), this matches up with address 0x0 (ldr r3, [r0, #0]). So, register r3 is used to store variable tmp_notify_bits. The parameter thread is stored in register r0. Now, line 04 (mask &= tmp_notify_bits) matches directly with address 0x2 (ands r1, r3). Register r1 matches directly with the mask parameter. The important part of restructuring the code as we have done, is that it becomes obvious that we can directly using the mask parameter instead of needing an extra variable like in the previous code. As we continue line 05 (tmp_notify_bits &= ~mask), matches directly to 0x4 (bics r3, r1). The bics instruction is quite neat in that it can essentially do the &= ~ in one 16-bit instruction. Line 06 (thread->notify_bits = tmp_notify_bits) stores the result back to memory, matches directly to 0x6. Now the final line of code (return mask;), needs two instructions 0x8 and 0xa (adds r0, r1, #0 and bx lr). Now the reason we need to instructions is because mask is stored in register r1, and the return value needs to be in register r0. So how can we get mask to be stored in r0 all along? Well, if we switch the parameters around poll(unsigned int mask, struct thread *thread), the mask will instead be stored in r0 instead of r1. (Note: We don’t necessarily have to change the interface of the function. If we want to support keep the interface we can use a simple macro to textually swap the parameters.) If we compile this, we get the following generated code:

00000000 <poll>:
poll():
   0:    680b          ldr    r3, [r1, #0]
   2:    4018          ands    r0, r3
   4:    4383          bics    r3, r0
   6:    600b          str    r3, [r1, #0]
   8:    4770          bx    lr

So, we have got this down to 5 instructions, 10 bytes. This is a factor 10 improvement, not bad!

So a bit of a review:

    * Compile to Thumb instructions.
    * Use -Os to optimise for space.
    * Look for unnecessary code.
    * Look for examples of registers used in the wrong place.

Posted by Benno Leslie on December 31 at 01:43 PM

blog comments powered by Disqus
Benno Leslie's avatar

About Benno Leslie:

Benno Leslie, Vice President of Engineering at OK Labs holds a dual degree in Computer Engineering (with first-class honors) and Arts from UNSW. While at work, Benno does his best to avoid the marketing department, while he oversees a team of lead engineers and the customer support organization. When summer hits, the rugby field comes calling and Benno is either tackling others, out cycling, or letting loose at concerts.

Email Benno Leslie

Virtualization for Embedded SystemesPermalink

▲ Back to Top