Skip to main content
Department of Information Technology
Uppsala Architecture Research Team

Extending Statistical Cache Models to Support Detailed Pipeline Simulators

Simulators are widely used in computer architecture research. While detailed cycle-accurate simulations provide useful insights, studies using modern workloads typically require days or weeks. Evaluating many design points, only exacerbates the simulation overhead. Recent works propose methods with good accuracy that reduce the simulated overhead either by sampling the execution (e.g., SMARTS and SimPoint) or by using fast analytical models of the simulated designs (e.g., Interval Simulation).

While these techniques reduce significantly the simulation overhead, modeling processor components with large state, such as the last-level cache, requires costly simulation to warm them up. Statistical simulation methods, such as SMARTS, report that the warm-up overhead accounts for 99% of the simulation overhead, while only 1% of the time is spent simulating the target design.

This paper proposes WarmSim, a method that eliminates the need to warm up the cache. WarmSim builds on top of a statistical cache modeling technique and extends it to model accurately not only the miss ratio but also the outcome of every cache request. WarmSim uses as input, an application´s memory reuse information which is hardware independent. Therefore, different cache configurations can be simulated using the same input data. We demonstrate that this approach can be used to estimate the CPI of the SPEC CPU2006 benchmarks with an average error of 1.77%, reducing the overhead compared to a simulation with a 10M instruction warm-up by a factor of 50x.

75% of the simulation time is spent in Functional warming

Speedup Stacks
At each simulation point, 5M instruction are required to warm-up a simulator which uses caches up to 2MB. This accounts for 75% of the simulation overhead.


Speedup Stacks
WarmSim is a statistical cache model which replaces the traditional cache of a simulator. For each memory request sent by the processor, it decides whether it hits or misses in the simulated caches. Its input is hardware independent profiling information and can be used to simulate any size cache.


Updated  2014-05-27 12:10:32 by Nikos Nikoleris.