If you have a smartphone, and the battery is quickly being drained, you may have discovered that by quitting apps or removing them from your phone, the battery lasts longer. Sensors, transmitters and receivers, and computers all take energy. It is something that mobile device designers are concerned about. It turns out app developers make important decisions which affect the battery life.
Of course, I consider aircraft and spacecraft to be very mobile devices. In fact, in the last week, I’ve heard from parts of the aerospace community on this subject. More specifically, they are concerned about how to minimize the energy consumption of computation. Their interest ranges from mobile CPUs and sensors to high performance computing.
Aside from the work that hardware designers do, what can/should software developers and users be aware of? I’m going to try to lay a foundation for the subject in this post. It may initially seem that energy consumption is outside a software person’s control. I will give some examples where it actually makes a difference. Other examples are taken from hardware, but illustrate the implications of system design choices, e.g., using a simple embedded system vs. a multi-tasking system.
Before doing so, I need to confess something. Between the computing hardware and software worlds, each side tends to believe that I belong to the other. Neither side enthusiastically claims me as their own. Really, I learned computing because I couldn’t get my hands on a wind tunnel. My youth was spent with T-square, right triangles, and French curves rather than oscilloscopes, resistors, and diodes. But professionally, I am a computer scientist. I happen to have worked with a lot of digital and computing systems designers, including a few years at the chip level.
All gates are created equal… to a first approximation…
For our purposes, we will think in terms of gates — fundamental blocks which have a 0 or 1 as their output. It is a major simplification. Chip designers frequently think in terms of NAND or NOR gates, switches, or transistors (going from most to least abstract). In fact, a NAND gate may be a 2-, 3-, or 4- input NAND gate. There are also nuances in rise and fall times of propagated signals. I’m also going to assume a silicon technology like CMOS rather than nMOS or pMOS or bipolar types. In practice, CMOS dominates the vast majority of digital devices. But for our purposes, all gates are created equal. (We’ll ignore whether or not some seem to be more equal than others…)
Fundamentally, changing the state of a gate takes energy. For a first order approximation, it doesn’t matter if it moves from 0 to 1 or 1 to 0; the result is pretty much the same. The amount of energy needed for a computation is directly related to the number gate state changes needed to complete the computation.
Program counters, page boundaries
One of the strange side effects is on the program counter. Assume you have a short loop that runs from 0x0010 to 0x0013 and back again. This takes less energy than a loop that runs from 0x0ffe to 0x1001. Why? Two major sets of gate state changes happen:
- Going from 0x0fff to 0x1000, there are 1 change of 0->1 and 12 changes of 1->0 — a total of 13 state changes.
- When a pass through the loop finishes, it jumps from 0x1001 to 0x0ffe, which has 11 changes of 0->1 and 2 changes of 1->0 — again 13 state changes.
In the other case, running between 0x0010 and 0x0013 the state changes are confined to 2 bits.
Moral of the story: frequent tight loops should avoid page boundaries.
The program counter is simply one part of the story. There is the execution of instructions, the impact on registers and memory, etc. But if those are the same between the two address sets, the code placement emerges as a variable that can be manipulated.
In understanding the order of an algorithm, a sorting program that processes n inputs in n steps (or even 2n steps) is considered to be linear. Its run-time is of order n, written O(n). As the number of inputs increases, asymptotic trends emerge.
A sorting program whose run-time increases as O(n log n) ultimately performs better than a program that runs O(n2) . (Computer scientists and software engineers understand this as the difference between the “quicksort” vs “bubblesort” algorithms.) If tricks can be played to let the sorting algorithm approach O(n), that algorithm has the potential to perform even better.
Varying processor performance
In the smartphone world, we’re beginning to see devices that have multiple high-performance cores and one low-power low-performance core on the same piece of silicon. When performance demand drops off and only maintenance tasks are running, the low-power core continues running while the high-performance cores are sleeping.
This technique is exploit by processor design ARM Holdings in its big.LITTLE™ processing design. (Yes, the world “big” is lower case, and “LITTLE” is all caps.) ARM claims this can reduce energy consumption by 70% or more for light workloads and by 50% for moderate workloads. In the case of big.LITTLE, ARM Cortex-A15 is paired with the lower power ARM Cortex-A7. [strange typos fixed 9/18]
Of course, the selection of whether to run just the low-power core as opposed to the high-performance cores is made by the operating system.
In some simpler systems, the user is able to select between higher performance or lower power, thus extending battery life. In this case, system clock speed is set high or low accordingly. The user enters a choice via the device user interface, which then interpreted by the operating system as a performance choice.
Smartphones typically have at least three types of radio — Bluetooth, wi-fi, and a wireless telecom standard (3G, 4G, perhaps even more). Communications for an app are a trade-off between necessary data rate vs. power consumption. When possible, the app developer should choose the communication mode that requires the least amount of power for the job. Often this means preferring wi-fi over the wireless telecom. (In fact, Bluetooth takes far less power than the others, but is not used as a multi-hop network protocol.)
DMA, buffers, interrupts
An operating system faces varying challenges in servicing I/O requests, particularly if it operates under real-time constraints. In modern computer architectures, block data can be transferred under DMA control without interrupting the CPU. (DMA = direct memory access) But once the DMA operation is finished, an interrupt has to be issued so that the data can be dealt with. This, of course, requires a context switch from a process to the kernel to possibly another process.
Some operating systems are more efficient about switching between processes than other. Historically, UNIX has been better at this than Windows. However, when you introduce threads (a lighter weight scheduling mechanism) into the picture, there is no clear advantage. Thus, Windows programs are often designed with many threads. In general, basic UNIX programs do not use multiple threads, but can easily be assembled as building blocks. On the other hand, UNIX database programs typically run as monolithic components, but make extensive use of threads. Java programs invariably run many threads.
Some I/O interfaces cannot run DMA and require more frequent OS attention. Sometimes there is a buffer for 3 or 4 characters. Before that overflows, the OS needs to copy the buffer content. This can result in very high context switching overhead when certain apps are running.
The newest iPhone, the 5S, supplements the main A7 chip with an M7 chip to directly deal with data from accelerometers, gyroscope, and compass. Details of the M7 chip have not yet been published. I suspect this vastly reduces the CPU load with motion-based apps are running. It certainly cuts out a lot of interrupts. What is not clear to me is whether or not the M7 also does low-power matrix computations. If this a sequence of matrix operations is only done 50 or 100 times per second, a high-performance multiplier-accumulator may not be necessary, and a low-power version can be put in the co-processor, further relieving the CPU of certain real-time burdens.
Loop vs. ‘halt’
When there is no more computation to do, some operating systems would put themselves into a tight idle loop waiting for next interrupt to come in. Others would execute a ‘halt’ instruction and wait. If you measure the CPU temperature, the latter is significantly cooler. The former would not be practical for a mobile device. Naturally, I consider aircraft and spacecraft to be mobile. So I dislike operating systems with tight idle loops.
Memory management schemes
A key computing architecture feature that makes smartphones possible is demand paging, a memory management scheme invented in the late 1960s. They make multi-tasking and adaptability to new programs a fundamental reality. But, the logic design behind a memory management unit (MMU) requires a LOT of gates, and thus consumes a lot of energy. Thus, for simple dedicated real-time systems, it may be best to avoid the need for a paging MMU.
Processsors such as ARM Cortex-M series use a segmentation scheme that load registers with a segment base address and a segment length. The complexity and power costs of a demand-paging MMU are not there.
The PDP-11 used a hybrid between paging and segment. PDP-11 programs were limited to 64K bytes. But the processor had 8 segment registers to map 8K segments the desired parts of memory. The processor could then switch application programs by switching the contents of the segment registers. As a result, many models of the PDP-11 were able to handle several users on a time-sharing system, giving rise to the UNIX operating system.
Given the small address space and small set of segment registers, a PDP-11 would take considerably few gates than a VAX-11. If both were resurrected today, the PDP-11 would be more power efficient.
It is worth nothing that virtualization, the current practice that replaces several physical machines by virtual machines has had an immense impact on energy footprint in data centers. However, a context switch between virtual machines is even more heavyweight than switching processes on a single machine.
The virtualization kernel and hypervisor(s) are presumed to be reliable components that separate unreliable operating systems from each other. Thus, the failure of a single process in an operating doesn’t affect other key application processes on the other virtual machine because there are not other application processes there.
In fact, the physical machine running the multiple virtual machines is probably consuming more energy than combining all the processes on a reliable operating system on the same physical machine. But managers of data centers are constrained to commercial choices available to them, which favor certain operating systems. Thus, the minimum energy consumption that can be achieved is not as low as it can be for a single operating system.
To be sure, there are good uses for virtualization. When different software packages require different versions of the same operating system, virtualization provides a way to host all the packages on the same hardware. Virtualization has evolved to where an application can be migrated off one physical machine and onto another one, letting the former be brought down for maintenance without interrupting program operation. These are just a couple of examples of legitimate uses.
The list goes on…
There are certainly considerable CPU design strategies can can affect energy consumption.
I could start in on speculative execution and other CPU accelerators, these gain performance at the expense of additional power. A stark contrast to these is the SPARC chip-multithreading (CMT) architecture. The fundamental concept gets rid of the accelerators, but replaces it by a massive set of simpler cores and threads, resulting in much lower power per CPU thread.
At this point, we’re clearly no longer in the realm that software or system designers can affect. And little or nothing can be done through software.
So that’s my view of computation energy consumption from a software perspective, but also peering into the system hardware.