UPMARC Workshop on Task-Based Parallel Programming
Energy efficiency of decoupled access/execute
Konstantinos Koukos, UPMARC.
Abstract. This work demonstrates the performance and power efficiency of decoupled access - execute models for state of the art CPUs. The decoupled access-execute model separates loading data into the cache (access) from the computation performed on that data (execute) at the runtime level. This enables multicore and multithread processors to intelligently prefetch data, thereby improving performance and power efficiency, and is a natural fit for task-parallel applications. This work investigates the potential of decoupled access-execute model to improve energy efficiency by adjusting processor voltage and frequency separately for access and execute phases. Fundamentally, this approach allows us to reduce processor power consumption (lower frequency and voltage) during the memory-bound access phase and increase processor performance (higher frequency and voltage) during the compute-bound execute phase. As a result we can finely tune energy consumption to the behavior of the application, thereby achieving improved energy efficiency with little complexity. Contrary to other static or dynamic DVFS techniques our approach does not rely on prediction but on previous task execution feedback. That minimizes total overhead and eliminates prediction error. A major contribution of this work is to define the theoretical limits on phase granularity for applicable optimal DVFS. We evaluate the effectiveness of this approach by running at the highest and lowest frequency for execute and access, respectively, and compare this to a simple first-order model based on the IPC of the access phase, that chooses higher frequencies for access phases with more complex address generation patterns. The results are evaluated on multicore systems with external power-measurement hardware and across six parallel benchmarks varying form compute- to memory-bound. We demonstrate an average of 10\% EDP reduction for both Intel and AMD platforms without any performance degradation. Finally our study is extended to show the efficiency of modern HW prefetchers and how our model can advance over the state of the art.