Uppsala Architecture Research Team Publications
2024
- Mutator-Driven Object Placement using Load Barriers. In MPLR 2024: Proceedings of the 21st ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes, Association for Computing Machinery (ACM), 2024. (DOI, Fulltext).
2023
- Doppelganger Loads: A Safe, Complexity-Effective Optimization for Secure Speculation Schemes. In ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture, Conference Proceedings Annual International Symposium on Computer Architecture, Association for Computing Machinery (ACM), New York, NY, 2023. (DOI, fulltext:print).
- Exploring the Latency Sensitivity of Cache Replacement Policies. In IEEE Computer Architecture Letters, volume 22, number 2, pp 93-96, Institute of Electrical and Electronics Engineers (IEEE), 2023. (DOI, fulltext:postprint).
- Faster FunctionalWarming with Cache Merging. In PROCEEDINGS OF SYSTEM ENGINEERING FOR CONSTRAINED EMBEDDED SYSTEMS, DRONESE AND RAPIDO 2023, pp 39-47, Association for Computing Machinery (ACM), 2023. (DOI).
- Game-of-Life Temperature-Aware DVFS Strategy for Tile-Based Chip Many-Core Processors. In IEEE Journal on Emerging and Selected Topics in Circuits and Systems, volume 13, number 1, pp 58-72, Institute of Electrical and Electronics Engineers (IEEE), 2023. (DOI).
- How addresses are made. In 2023 IEEE International ymposium on Workload Characterization, IISWC, International Symposium on Workload Characterization Proceedings, pp 223-225, IEEE, 2023. (DOI).
- Large-scale Graph Processing on Commodity Systems: Understanding and Mitigating the Impact of Swapping. In The International Symposium on Memory Systems (MEMSYS '23), pp 1-11, Association for Computing Machinery (ACM), 2023. (DOI, Fulltext, fulltext:print).
- Protean: Resource-efficient Instruction Prefetching. In The International Symposium on Memory Systems (MEMSYS '23), pp 1-13, Association for Computing Machinery (ACM), 2023. (DOI, Fulltext, fulltext:print).
- ReCon: Efficient Detection, Management, and Use of Non-Speculative Information Leakage. In 56th IEEE/ACM International Symposium on Microarchitecture, MICRO 2023, pp 828-842, Association for Computing Machinery (ACM), 2023. (DOI, Fulltext, fulltext:print).
- SE-CNN: Convolution Neural Network Acceleration via Symbolic Value Prediction. In IEEE Journal on Emerging and Selected Topics in Circuits and Systems, volume 13, number 1, pp 73-85, Institute of Electrical and Electronics Engineers (IEEE), 2023. (DOI).
- Silent Stores in the Battery-less Internet of Things: A Good Idea?. In , 2023.
- Speculative inter-thread store-to-load forwarding in SMT architectures. In Journal of Parallel and Distributed Computing, volume 173, pp 94-106, Elsevier, 2023. (DOI, Fulltext).
2022
- Analysing software prefetching opportunities in hardware transactional memory. In Journal of Supercomputing, volume 78, number 1, pp 919-944, Springer Nature, 2022. (DOI).
- Clueless: A Tool Characterising Values Leaking as Addresses. In Proceedings of the 11th International Workshop on Hardware and Architectural Support for Security And Privacy, HASP 2022, pp 27-34, Association for Computing Machinery (ACM), 2022. (DOI, Fulltext, fulltext:print).
- Data-Out Instruction-In (DOIN!): Leveraging Inclusive Caches to Attack Speculative Delay Schemes. In 2022 IEEE International Symposium on Secure and Private Execution Environment Design (SEED 2022), pp 49-60, Institute of Electrical and Electronics Engineers (IEEE), 2022. (DOI).
- Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks. In ACM Transactions on Architecture and Code Optimization (TACO), volume 20, number 1, Association for Computing Machinery (ACM), 2022. (DOI, Fulltext, fulltext:print).
- Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order Cores. In ACM Transactions on Architecture and Code Optimization (TACO), volume 19, number 2, Association for Computing Machinery (ACM), 2022. (DOI).
- Every Walk's a Hit: Making Page Walks Single-Access Cache Hits. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22), February 28 – March 4, 2022, Lausanne, Switzerland, Association for Computing Machinery (ACM), 2022. (DOI, Fulltext, fulltext:postprint, fulltext:print).
- Faster Functional Warming with Cache Merging. 2022. (fulltext).
- Free Atomics: Hardware Atomic Operations without Fences. In PROCEEDINGS OF THE 2022 THE 49TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA '22), Conference Proceedings Annual International Symposium on Computer Architecture, pp 14-26, Association for Computing Machinery (ACM), 2022. (DOI).
- Splash-4: A Modern Benchmark Suite with Lock-Free Constructs. In 2022 IEEE International Symposium on Workload Characterization (IISWC), Proceedings of the IEEE International Symposium on Workload Characterization, pp 51-64, Institute of Electrical and Electronics Engineers (IEEE), 2022. (DOI).
- Supporting Dynamic Translation Granularity for Hybrid Memory Systems. In 2022 IEEE 40th International Conference on Computer Design (ICCD), Proceedings IEEE International Conference on Computer Design, pp 25-32, Institute of Electrical and Electronics Engineers (IEEE), 2022. (DOI).
2021
- A Reusable Characterization of the Memory System Behavior of SPEC2017 and SPEC2006. In ACM Transactions on Architecture and Code Optimization (TACO), volume 18, number 2, Association for Computing Machinery (ACM), 2021. (DOI).
- Do Not Predict – Recompute!: How Value Recomputation Can Truly Boost the Performance of Invisible Speculation. In 2021 International Symposium on Secure and Private Execution Environment Design (SEED), pp 89-100, Institute of Electrical and Electronics Engineers (IEEE), 2021. (DOI).
- Early Address Prediction: Efficient Pipeline Prefetch and Reuse. In ACM Transactions on Architecture and Code Optimization (TACO), volume 18, number 3, Association for Computing Machinery (ACM), 2021. (DOI, Fulltext, fulltext:print).
- Efficient, Distributed, and Non-Speculative Multi-Address Atomic Operations. In Proceedings of 54th Annual IEEE/ACM International Symposium on Microarchitecture, Micro 2021, International Symposium on Microarchitecture Proceedings, pp 337-349, Association for Computing Machinery (ACM), 2021. (DOI).
- ITSLF: Inter-Thread Store-to-Load Forwarding in Simultaneous Multithreading. In Proceedings of 54th Annual IEEE/ACM International Symposium on Microarchitecture, Micro 2021, International Symposium on Microarchitecture Proceedings, pp 1296-1308, Association for Computing Machinery (ACM), 2021. (DOI).
- Reorder Buffer Contention: A Forward Speculative Interference Attack for Speculation Invariant Instructions. In IEEE COMPUTER ARCHITECTURE LETTERS, volume 20, number 2, pp 162-165, Institute of Electrical and Electronics Engineers (IEEE), 2021. (DOI).
- Seeds of SEED: Preventing Priority Inversion in Instruction Scheduling to Disrupt Speculative Interference. In 2021 International Symposium on Secure and Private Execution Environment Design (SEED), pp 101-107, Institute of Electrical and Electronics Engineers (IEEE), 2021. (DOI).
- Splash-4: Improving Scalability with Lock-Free Constructs. In 2021 IEEE International Symposium On Performance Analysis Of Systems And Software (ISPASS 2021), pp 235-236, Institute of Electrical and Electronics Engineers (IEEE), 2021. (DOI).
- TSOPER: Efficient Coherence-Based Strict Persistency. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), International Symposium on High-Performance Computer Architecture : Proceedings, pp 125-138, Institute of Electrical and Electronics Engineers (IEEE), 2021. (DOI).
2020
- Architecturally-independent and time-based characterization of SPEC CPU 2017. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)Raw-Data: A Reusable Characterization Of The Memory System behavior Of SPEC 2017 And SPEC 2006, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 107-109, 2020. (DOI, fulltext:postprint, fulltext:preprint).
- Boosting Store Buffer Efficiency with Store-Prefetch Bursts. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 568-580, Institute of Electrical and Electronics Engineers (IEEE), 2020. (DOI, Fulltext, fulltext:print).
- Clearing the Shadows: Recovering Lost Performance for Invisible Speculative Execution through HW/SW Co-Design. In PACT ’20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, International Conference on Parallel Architectures and Compilation Techniques, pp 241-254, Association for Computing Machinery (ACM), 2020. (DOI, External link).
- Decoupled Address Translation for Heterogeneous Memory Systems. In PACT '20: PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, International Conference on Parallel Architectures and Compilation Techniques, pp 155-156, ASSOC COMPUTING MACHINERY, 2020. (DOI, Fulltext).
- Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), International Symposium on High-Performance Computer Architecture-Proceedings, pp 424-434, 2020. (DOI).
- Efficient temporal and spatial load to load forwarding. In Proc. 26th International Symposium on High-Performance and Computer Architecture, IEEE Computer Society, 2020.
- Evaluating the Potential Applications of Quaternary Logic for Approximate Computing. In ACM Journal on Emerging Technologies in Computing Systems, volume 16, number 1, Association for Computing Machinery (ACM), 2020. (DOI).
- Modeling and Optimizing NUMA Effects and Prefetching with Machine Learning. In ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing, 2020. (DOI, Fulltext, fulltext:postprint).
- Perforated Page: Supporting Fragmented Memory Allocation for Large Pages. In Proceedings of the 47th Annual ACM/IEEE International Symposium on Computer Architecture (ISCA), pp 913-925, 2020. (DOI, fulltext:postprint).
- RVSDG: An Intermediate Representation for Optimizing Compilers. In ACM Transactions on Embedded Computing Systems, volume 19, number 6, 2020. (DOI).
- Raw-Data: A Reusable Characterization Of The Memory System Behavior Of SPEC 2017 And SPEC 2006. 2020. (data set).
- Reconciling Time Slice Conflicts of Virtual Machines With Dual Time Slice for Clouds. In IEEE Transactions on Parallel and Distributed Systems, volume 31, number 10, pp 2453-2465, 2020. (DOI).
- Speculative Enforcement of Store Atomicity. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 555-567, Institute of Electrical and Electronics Engineers (IEEE), 2020. (DOI, Fulltext, fulltext:postprint).
- Twig: Multi-Agent Task Management for Colocated Latency-Critical Cloud Services. In , IEEE, 2020. (DOI).
- Understanding Selective Delay as a Method for Efficient Secure Speculative Execution. In IEEE Transactions on Computers, volume 69, number 11, pp 1584-1595, 2020. (DOI).
2019
- Directed Statistical Warming through Time Traveling. In MICRO'52: The 52nd Annual IEEE/ACM International Symposium On Microarchitecture, pp 1037-1049, 2019. (DOI).
- Efficient invisible speculative execution through selective delay and value prediction. In Proc. 46th International Symposium on Computer Architecture, pp 723-735, ACM Press, New York, 2019. (DOI, fulltext:postprint).
- Efficient thread/page/parallelism autotuning for NUMA systems. In ICS '19: Proceedings of the ACM International Conference on Supercomputing, pp 342-353, Association for Computing Machinery (ACM), New York, NY, USA, 2019. (DOI, Fulltext, fulltext:print).
- FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), Design Automation and Test in Europe Conference and Exhibition, pp 716-721, IEEE, 2019. (DOI, fulltext:postprint).
- Filter caching for free: The untapped potential of the store-buffer. In Proc. 46th International Symposium on Computer Architecture, pp 436-448, ACM Press, New York, 2019. (DOI, Fulltext, fulltext:print).
- Freeway: Maximizing MLP for Slice-Out-of-Order Execution. In 2019 25th IEEE International Symposium On High Performance Computer Architecture (HPCA), International Symposium on High-Performance Computer Architecture-Proceedings, pp 558-569, IEEE, 2019. (DOI, fulltext:postprint).
- Ghost Loads: What is the cost of invisible speculation?. In Proceedings of the 16th ACM International Conference on Computing Frontiers, pp 153-163, ACM Press, New York, 2019. (DOI, fulltext:postprint).
- Maximizing limited resources: A limit-based study and taxonomy of out-of-order commit. In Journal of Signal Processing Systems, volume 91, number 3-4, pp 379-397, 2019. (DOI, Fulltext, fulltext:print).
- Minimizing Replay under Way-Prediction. Technical report / Department of Information Technology, Uppsala University nr 2019-003, 2019. (fulltext).
- Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing. In ACM Transactions on Reconfigurable Technology and Systems, volume 12, number 3, ASSOC COMPUTING MACHINERY, 2019. (DOI).
2018
- Analyzing performance variation of task schedulers with TaskInsight. In Parallel Computing, volume 75, pp 11-27, 2018. (DOI).
- Automatic Detection of Large Extended Data-Race-Free Regions with Conflict Isolation. In IEEE Transactions on Parallel and Distributed Systems, volume 29, number 3, pp 527-541, IEEE COMPUTER SOC, 2018. (DOI).
- Behind the Scenes: Memory Analysis of Graphical Workloads on Tile-based GPUs. In Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2018, pp 1-11, IEEE Computer Society, 2018. (DOI, fulltext:preprint).
- Delorean: Virtualized Directed Profiling for Cache Modeling in Sampled Simulation. Technical report / Department of Information Technology, Uppsala University nr 2018-014, 2018. (fulltext).
- Dynamically Disabling Way-prediction to Reduce Instruction Replay. In 2018 IEEE 36th International Conference on Computer Design (ICCD), Proceedings IEEE International Conference on Computer Design, pp 140-143, IEEE, 2018. (DOI, External link).
- Mending fences with self-invalidation and self-downgrade. In Logical Methods in Computer Science, volume 14, number 1, 2018. (DOI, Fulltext).
- Non-Speculative Load Reordering in Total Store Ordering. In IEEE Micro, volume 38, number 3, pp 48-57, IEEE COMPUTER SOC, 2018. (DOI).
- Non-Speculative Store Coalescing in Total Store Order. In Proc.45th International Symposium on Computer Architecture, pp 221-234, IEEE, 2018. (DOI, fulltext:postprint).
- SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp 328-343, Association for Computing Machinery (ACM), 2018. (DOI, fulltext:print).
- Static instruction scheduling for high performance on limited hardware. In IEEE Transactions on Computers, volume 67, number 4, pp 513-527, 2018. (DOI).
- Tail-PASS: Resource-based Cache Management for Tiled Graphics Rendering Hardware. In Proc. 16th International Conference on Parallel and Distributed Processing with Applications, pp 55-63, IEEE, 2018. (DOI).
- The Superfluous Load Queue. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 95-107, IEEE, 2018. (DOI, fulltext:postprint).
2017
- A Taxonomy of Out-of-Order Instruction Commit. In 2017 Ieee International Symposium On Performance Analysis Of Systems And Software (Ispass), pp 135-136, IEEE Computer Society, Los Alamitos, 2017. (DOI).
- A dedicated private-shared cache design for scalable multiprocessors. In Concurrency and Computation, volume 29, number 2, 2017. (DOI).
- A graphics tracing framework for exploring CPU+GPU memory systems. In Proc. 20th International Symposium on Workload Characterization, pp 54-65, IEEE, 2017. (DOI).
- A split cache hierarchy for enabling data-oriented optimizations. In Proc. 23rd International Symposium on High Performance Computer Architecture, pp 133-144, IEEE Computer Society, 2017. (DOI).
- Adaptive cache warming for faster simulations. In Proc. 9th Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, ACM Press, New York, 2017. (DOI, Fulltext, fulltext:print).
- Addressing energy challenges in filter caches. In Proc. 29th International Symposium on Computer Architecture and High Performance Computing, pp 49-56, IEEE Computer Society, 2017. (DOI).
- Analyzing Graphics Workloads on Tile-based GPUs. In Proc. 20th International Symposium on Workload Characterization, pp 108-109, IEEE, 2017. (DOI).
- Automatic detection of extended data-race-free regions. In Proc. 15th International Symposium on Code Generation and Optimization, pp 14-26, IEEE Press, Piscataway, NJ, 2017. (Paper, fulltext:postprint).
- Clairvoyance: Look-ahead compile-time scheduling. In Proc. 15th International Symposium on Code Generation and Optimization, pp 171-184, IEEE Press, Piscataway, NJ, 2017. (fulltext:postprint).
- Decoupled Access-Execute on ARM big.LITTLE. In Proc. 5th Workshop on High Performance Energy Efficient Embedded Systems, 2017. (External link).
- Efficient Self-Invalidation/Self-Downgrade for Critical Sections with Relaxed Semantics. In IEEE Transactions on Parallel and Distributed Systems, volume 28, number 12, pp 3413-3425, 2017. (DOI).
- Exploring scheduling effects on task performance with TaskInsight. In Supercomputing frontiers and innovations, volume 4, number 3, pp 91-98, 2017. (DOI, Fulltext).
- Exploring the performance limits of out-of-order commit. In Proc. 14th Computing Frontiers Conference, pp 211-220, ACM Press, New York, 2017. (DOI, attachment:print).
- Non-speculative load-load reordering in TSO. In Proc. 44th International Symposium on Computer Architecture, pp 187-200, ACM Press, New York, 2017. (DOI).
- POSTER: Putting the G back into GPU/CPU Systems Research. In 2017 26TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT), International Conference on Parallel Architectures and Compilation Techniques, pp 130-131, 2017. (DOI).
- Scope-Aware Classification: Taking the hierarchical private/shared data classification to the next level. Technical report / Department of Information Technology, Uppsala University nr 2017-008, 2017. (fulltext).
- TaskInsight: Understanding task schedules effects on memory and performance. In Proc. 8th International Workshop on Programming Models and Applications for Multicores and Manycores, pp 11-20, ACM Press, New York, 2017. (DOI, Fulltext).
- The best of both works: A hybrid data-race-free cache coherence scheme. 2017.
- Transcending hardware limits with software out-of-order processing. In IEEE Computer Architecture Letters, volume 16, number 2, pp 162-165, 2017. (DOI).
- Understanding the interplay between task scheduling, memory and performance. In Proc. Companion 8th ACM International Conference on Systems, Programming, Languages, and Applications: Software for Humanity, pp 21-23, ACM Press, New York, 2017. (DOI).
2016
- A hybrid static–dynamic classification for dual-consistency cache coherence. In IEEE Transactions on Parallel and Distributed Systems, volume 27, number 11, pp 3101-3115, 2016. (DOI).
- A unified DVFS-cache resizing framework. Technical report / Department of Information Technology, Uppsala University nr 2016-014, 2016. (fulltext).
- Analytical Processor Performance and Power Modeling Using Micro-Architecture Independent Characteristics. In IEEE Transactions on Computers, volume 65, number 12, pp 3537-3551, 2016. (DOI).
- Approximation: A New Paradigm also for Wireless Sensing. In , 2016.
- Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead. In ACM Transactions on Architecture and Code Optimization (TACO), volume 13, number 1, 2016. (DOI, fulltext:preprint).
- Characterizing Task Scheduling Performance Based on Data Reuse. In Proc. 9th Nordic Workshop on Multi-Core Computing, 2016. (fulltext:print).
- CoolSim: Eliminating Traditional Cache Warming with Fast, Virtualized Profiling. In 2016 IEEE International Symposium On Performance Analysis Of Systems And Software ISPASS 2016, IEEE International Symposium on Performance Analysis of Systems and Software-ISPASS, pp 149-150, 2016.
- CoolSim: Statistical Techniques to Replace Cache Warming with Efficient, Virtualized Profiling. In Proceedings Of 2016 International Conference On Embedded Computer Systems: Architectures, Modeling And Simulation (Samos), pp 106-115, IEEE, 2016.
- Data placement across the cache hierarchy: Minimizing data movement with reuse-aware placement. In Proc. 34th International Conference on Computer Design, Proceedings IEEE International Conference on Computer Design, pp 117-124, IEEE, Piscataway, NJ, 2016. (DOI).
- Efficient Self-Invalidation/Self-Downgrade for Critical Sections with Relaxed Semantics. In Proc. International Conference on Parallel Architectures and Compilation: PACT 2016, pp 433-434, ACM Press, New York, 2016. (DOI).
- Fencing programs with self-invalidation and self-downgrade. In Formal Techniques for Distributed Objects, Components, and Systems, volume 9688 of Lecture Notes in Computer Science, pp 19-35, Springer, 2016. (DOI).
- Formalizing data locality in task parallel applications. In Algorithms and Architectures for Parallel Processing, volume 10049 of Lecture Notes in Computer Science, pp 43-61, Springer, 2016. (DOI).
- Multiversioned decoupled access-execute: The key to energy-efficient compilation of general-purpose programs. In Proc. 25th International Conference on Compiler Construction, pp 121-131, ACM Press, New York, 2016. (DOI, fulltext:print).
- Partitioning GPUs for Improved Scalability. In Proc. 28th International Symposium on Computer Architecture and High Performance Computing, International Symposium on Computer Architecture and High Performance Computing, pp 42-49, IEEE Computer Society, 2016. (DOI).
- Practical way halting by speculatively accessing halt tags. In Proc. 19th Conference on Design, Automation and Test in Europe, pp 1375-1380, IEEE, Piscataway, NJ, 2016.
- Profiling-Assisted Decoupled Access-Execute. In Proc. 4th International Workshop on High Performance Energy Efficient Embedded Systems, 2016. (External link).
- Racer: TSO Consistency via Race Detection. In 2016 49Th Annual IEEE/ACM International Symposium On Microarchitecture (MICRO), International Symposium on Microarchitecture Proceedings, 2016.
- Redesigning a tagless access buffer to require minimal ISA changes. In Proc. 19th International Conference on Compilers, Architectures and Synthesis for Embedded Systems, 2016. (DOI).
- Spatial and Temporal Cache Sharing Analysis in Tasks. In , Timisoara, Romania, 2016. (Proceedings, fulltext:print).
- Splash-3: A properly synchronized benchmark suite for contemporary research. In Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2016, pp 101-111, IEEE Computer Society, 2016. (DOI).
- Techniques for modulating error resilience in emerging multi-value technologies. In Proc. 13th International Conference on Computing Frontiers, pp 55-63, ACM Press, New York, 2016. (DOI, fulltext:postprint).
2015
- A dual-consistency cache coherence protocol. In Proc. 29th International Parallel and Distributed Processing Symposium, pp 1119-1128, IEEE Computer Society, Los Alamitos, CA, 2015. (DOI, fulltext:print).
- AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance. In Proc. 24th International Conference on Parallel Architectures and Compilation Techniques, pp 367-378, IEEE Computer Society, 2015. (DOI, fulltext:postprint).
- An efficient, self-contained, on-chip directory: DIR<sub>1</sub>-SISD. In Proc. 24th International Conference on Parallel Architectures and Compilation Techniques, pp 317-330, IEEE Computer Society, 2015. (DOI).
- Callback: Efficient Synchronization without Invalidation with a Directory Just for Spin-Waiting. In 2015 ACM/IEEE 42Nd Annual International Symposium On Computer Architecture (ISCA), pp 427-438, 2015. (DOI).
- Cost-effective speculative scheduling in high performance processors. In Proc. 42nd International Symposium on Computer Architecture, pp 247-259, ACM Press, New York, 2015. (DOI).
- Effects of Granularity/Adaptivity on Private/Shared Classification for Coherence. In , 2015.
- Full speed ahead: Detailed architectural simulation at near-native speed. In Proc. 18th International Symposium on Workload Characterization, pp 183-192, IEEE Computer Society, 2015. (DOI).
- Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies. In Proc. 21st International Symposium on High Performance Computer Architecture, pp 186-197, IEEE Computer Society Digital Library, 2015. (DOI).
- Improving data access efficiency by using context-aware loads and stores. In Proc. 16th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, pp 27-36, ACM Press, New York, 2015. (DOI).
- Long Term Parking (LTP): Criticality-aware Resource Allocation in OOO Processors. In Proc. 48th International Symposium on Microarchitecture, pp 334-346, 2015. (DOI).
- Micro-Architecture Independent Analytical Processor Performance and Power Modeling. In 2015 IEEE International Symposium on Performance Analysis and Software (ISPASS), IEEE International Symposium on Performance Analysis of Systems and Software-ISPASS, pp 32-41, 2015.
- Optimizing transfers of control in the static pipeline architecture. In Proc. 16th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, pp 7-16, ACM Press, New York, 2015. (DOI).
- Perf-Insight: A Simple, Scalable Approach to Optimal Data Prefetching in Multicores. Technical report / Department of Information Technology, Uppsala University nr 2015-037, 2015. (fulltext).
- Scheduling instruction effects for a statically pipelined processor. In Proc. International Conference on Compilers, Architectures, and Synthesis for Embedded Systems: CASES 2015, pp 167-176, IEEE Press, Piscataway, NJ, 2015. (DOI).
- StatTask: Reuse distance analysis for task-based applications. In Proc. 7th Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, pp 1-7, ACM Press, New York, 2015. (DOI).
- The Load Slice Core Microarchitecture. In 2015 ACM/IEEE 42Nd Annual International Symposium On Computer Architecture (ISCA), pp 272-284, 2015. (DOI).
- The effects of granularity and adaptivity on private/shared classification for coherence. In ACM Transactions on Architecture and Code Optimization (TACO), volume 12, number 3, 2015. (DOI).
2014
- A case for resource efficient prefetching in multicores. In Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2014, pp 137-138, IEEE Computer Society, 2014. (DOI).
- A case for resource efficient prefetching in multicores. In Proc. 43rd International Conference on Parallel Processing, pp 101-110, IEEE Computer Society, 2014. (DOI).
- A software based profiling method for obtaining speedup stacks on commodity multi-cores. In 2014 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS): ISPASS 2014, IEEE International Symposium on Performance Analysis of Systems and Software-ISPASS, pp 148-157, IEEE Computer Society, 2014. (DOI).
- A tunable cache for approximate computing. In Proc. 10th International Symposium on Nanoscale Architectures, IEEE International Symposium on Nanoscale Architectures, pp 88-89, IEEE, Piscataway, NJ, 2014. (DOI).
- Dynamic and speculative polyhedral parallelization using compiler-generated skeletons. In International journal of parallel programming, volume 42, number 4, pp 529-545, 2014. (DOI).
- Extending statistical cache models to support detailed pipeline simulators. In 2014 IEEE International Symposium On Performance Analysis Of Systems And Software (Ispass), IEEE International Symposium on Performance Analysis of Systems and Software-ISPASS, pp 86-95, IEEE Computer Society, 2014. (DOI).
- Fix the code. Don't tweak the hardware: A new compiler approach to Voltage–Frequency scaling. In Proc. 12th International Symposium on Code Generation and Optimization, pp 262-272, ACM Press, New York, 2014. (URL, fulltext:postprint).
- Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed. Technical report / Department of Information Technology, Uppsala University nr 2014-005, 2014. (External link, fulltext).
- Managing power constraints in a single-core scenario through power tokens. In Journal of Supercomputing, volume 68, number 1, pp 414-442, 2014. (DOI).
- Power-Efficient Computer Architectures: Recent Advances. Morgan & Claypool Publishers, 2014. (DOI).
- Resource conscious prefetching for irregular applications in multicores. In Proc. International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), pp 34-43, IEEE, Piscataway, NJ, 2014. (DOI).
- Software-controlled processor stalls for time and energy efficient data locality optimization. In Proc. International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), pp 199-206, IEEE, Piscataway, NJ, 2014. (DOI, fulltext:postprint).
- Speculative program parallelization with scalable and decentralized runtime verification. In Runtime Verification, volume 8734 of Lecture Notes in Computer Science, pp 124-139, Springer Berlin/Heidelberg, 2014. (DOI, fulltext:postprint).
- The Direct-to-Data (D2D) Cache: Navigating the cache hierarchy with a single lookup. In Proc. 41st International Symposium on Computer Architecture, pp 133-144, IEEE Press, Piscataway, NJ, 2014. (DOI).
- The Effects of Granularity and Adaptivity on Private/Shared Classification for Coherence. In , 2014.
2013
- A New Perspective for Efficient Virtual-Cache Coherence. In Proceedings of the 40th Annual International Symposium on Computer Architecture, pp 535-546, 2013. (DOI).
- Bandwidth Bandit: Quantitative Characterization of Memory Contention. In Proc. 11th International Symposium on Code Generation and Optimization: CGO 2013, pp 99-108, IEEE Computer Society, 2013. (DOI).
- Dynamic and speculative polyhedral parallelization of loop nests using binary code patterns. In ICCS 2013, volume 18 of Procedia Computer Science, pp 2575-2578, 2013. (DOI, fulltext:postprint).
- Efficient inter-core power and thermal balancing for multicore processors. In Computing, volume 95, number 7, pp 537-566, 2013. (DOI).
- Introducing DVFS-Management in a Full-System Simulator. In Proc. 21st International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, IEEE Computer Society, 2013.
- Modeling performance variation due to cache sharing. In Proc. 19th IEEE International Symposium on High Performance Computer Architecture, pp 155-166, IEEE Computer Society, 2013. (DOI, fulltext:postprint).
- Online dynamic dependence analysis for speculative polyhedral parallelization. In Euro-Par 2013 Parallel Processing, volume 8097 of Lecture Notes in Computer Science, pp 191-202, Springer Berlin/Heidelberg, 2013. (DOI, fulltext:postprint).
- Shared Resource Sensitivity in Task-Based Runtime Systems. In Proc. 6th Swedish Workshop on Multi-Core Computing, Halmstad University Press, 2013. (fulltext:postprint).
- System and method for data classification and efficient virtual cache coherence without reverse translation. 2013.
- TLC: A tag-less cache for reducing dynamic first level cache energy. In Proceedings of the 46th International Symposium on Microarchitecture, pp 49-61, ACM Press, New York, 2013. (DOI, Conference website).
- Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models. In PARMA 2013, 4th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures, 2013. (Conference website, fulltext:postprint).
- Towards more efficient execution: a decoupled access-execute approach. In Proc. 27th ACM International Conference on Supercomputing, pp 253-262, ACM Press, New York, 2013. (DOI, fulltext:print).
2012
- Bandwidth bandit: Quantitative characterization of memory contention. In Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, pp 457-458, 2012. (DOI).
- Complexity-effective multicore coherence. In Proc. 21st International Conference on Parallel Architectures and Compilation Techniques, pp 241-251, ACM Press, New York, 2012. (DOI).
- Efficient techniques for predicting cache sharing and throughput. In Proc. 21st International Conference on Parallel Architectures and Compilation Techniques, pp 305-314, ACM Press, New York, 2012. (DOI, fulltext:postprint).
- Low Overhead Instruction-Cache Modeling Using Instruction Reuse Profiles. In International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'12), Computer Architecture and High Performance Computing, pp 260-269, IEEE Computer Society, 2012. (DOI).
- Phase Behavior in Serial and Parallel Applications. In International Symposium on Workload Characterization (IISWC'12), IEEE Computer Society, 2012.
- Phase Guided Profiling for Fast Cache Modeling. In International Symposium on Code Generation and Optimization (CGO'12), pp 175-185, ACM Press, 2012. (DOI).
- Power-Sleuth: A Tool for Investigating your Program's Power Behavior. In International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS'12), pp 241-250, 2012. (DOI).
- Quantitative Characterization of Memory Contention. Technical report / Department of Information Technology, Uppsala University nr 2012-029, Uppsala universitet, Uppsala, 2012. (on department web, fulltext).
2011
- A simple model for tuning tasks. In Proc. 4th Swedish Workshop on Multi-Core Computing, pp 45-49, Linköping University, Linköping, Sweden, 2011.
- A simple statistical cache sharing model for multicores. In Proc. 4th Swedish Workshop on Multi-Core Computing, pp 31-36, Linköping University, Linköping, Sweden, 2011. (fulltext:postprint).
- Cache Pirating: Measuring the curse of the shared cache. Technical report / Department of Information Technology, Uppsala University nr 2011-001, 2011. (fulltext).
- Cache Pirating: Measuring the Curse of the Shared Cache. In Proc. 40th International Conference on Parallel Processing, pp 165-175, IEEE Computer Society, 2011. (DOI).
- Computing Systems: Research Challenges Ahead: The HiPEAC Vision 2011/2012. 2011. (PDF, fulltext).
- Efficient software-based online phase classification. In International Symposium on Workload Characterization (IISWC'11), pp 104-115, IEEE Computer Society, 2011. (DOI).
- Fast modeling of shared caches in multicore systems. In Proc. 6th International Conference on High Performance and Embedded Architectures and Compilers, pp 147-157, ACM Press, New York, 2011. (DOI).
- Green governors: A framework for continuously adaptive DVFS. In Proc. International Green Computing Conference and Workshops: IGCC 2011, pp 1-8, IEEE, Piscataway, NJ, 2011. (DOI).
- Leakage-efficient design of value predictors through state and non-state preserving techniques. In Journal of Supercomputing, volume 55, number 1, pp 28-50, 2011. (DOI).
- Power Token Balancing: Adapting CMPs to power constraints for parallel multithreaded workloads. In Proc. 25th International Parallel and Distributed Processing Symposium, pp 431-442, IEEE, Piscataway, NJ, 2011. (DOI).
- Power-performance adaptation in Intel core i7. In Proc. 2nd Workshop on Computer Architecture and Operating System co-design, p 10, Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, 2011. (Proceedings).
- Using hardware transactional memory for high-performance computing. In Proc. 25th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, pp 1660-1667, IEEE, Piscataway, NJ, 2011. (DOI).
2010
- A Software Technique for Reducing Cache Pollution. In Proc. 3rd Swedish Workshop on Multi-Core Computing, pp 59-62, Chalmers University of Technology, Göteborg, Sweden, 2010. (fulltext:postprint).
- Block-Parallel Programming for Real-time Embedded Applications. In Proc. 39th International Conference on Parallel Processing, pp 297-306, IEEE, Piscataway, NJ, 2010. (DOI, fulltext:postprint).
- Efficient cache modeling with sparse data. In Processor and System-on-Chip Simulation, pp 193-209, Springer, New York, 2010. (DOI).
- Interval-based models for run-time DVFS orchestration in superscalar processors. In Proc. 7th International Conference on Computing Frontiers, pp 287-296, ACM Press, New York, 2010. (DOI).
- MLP-aware instruction queue resizing: The key to power-efficient performance. In Architecture of Computing Systems – ARCS 2010, volume 5974 of Lecture Notes in Computer Science, pp 113-125, Springer-Verlag, Berlin, 2010. (DOI).
- Parallelizing multicore cache simulations on GPUs. In Proc. 3rd Swedish Workshop on Multi-Core Computing, pp 3-8, Chalmers University of Technology, Göteborg, Sweden, 2010.
- Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis: SC 2010, p 11, IEEE, Piscataway, NJ, 2010. (DOI, fulltext:print).
- SARC coherence: Scaling directory cache coherence in performance and power. In IEEE Micro, volume 30, number 5, pp 54-65, 2010. (DOI).
- StatCC: a statistical cache contention model. In Proc. 19th International Conference on Parallel Architectures and Compilation Techniques, pp 551-552, ACM Press, New York, 2010. (DOI).
- StatStack: Efficient modeling of LRU caches. In Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2010, pp 55-65, IEEE, Piscataway, NJ, 2010. (DOI).
- Where replacement algorithms fail: a thorough analysis. In Proc. 7th International Conference on Computing Frontiers, pp 141-150, ACM Press, New York, 2010. (DOI).
Older Publications
- Reconsidering algorithms for iterative solvers in the multicore era. In International Journal of Computational Science and Engineering, volume 4, pp 270-282, 2009. (DOI).
- Improving cache utilization using Acumem VPE. In Tools for High Performance Computing, pp 115-135, Springer-Verlag, Berlin, 2008. (DOI).
- A case for low-complexity MP architectures. In Proc. Conference on Supercomputing: SC 2007, pp 559-570, ACM Press, New York, 2007. (DOI).
- Computer system employing bundled prefetching. US, 2007.
- Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution.. In 21st International Parallel and Distributed Processing Symposium, 2007.
- Multiprocessing computer system employing capacity prefetching. US, 2007.
- A Statistical Multiprocessor Cache Model. In Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2006, pp 89-99, IEEE, Piscataway, NJ, 2006. (DOI).
- Computer system including a promise array. US, 2006.
- Exploiting Locality: A Flexible DSM Approach. In Proc. 20th IEEE International Parallel and Distributed Processing Symposium: (IPDPS 2006) Rhodes, Greece, April 2006, 2006.
- Modeling cache sharing on chip multiprocessor architectures. In Proc. International Symposium on Workload Characterization: IISWC 2006, pp 160-171, IEEE, Piscataway, NJ, 2006. (DOI).
- Multigrid and Gauss-Seidel smoothers revisited: Parallelization on chip multiprocessors. Technical report / Department of Information Technology, Uppsala University nr 2006-018, 2006. (fulltext).
- Multigrid and Gauss-Seidel smoothers revisited: Parallelization on chip multiprocessors. In Proc. 20th ACM International Conference on Supercomputing, pp 145-155, ACM Press, New York, 2006. (DOI).
- Multiprocessing computer system employing capacity prefetching. US, 2006.
- Multiprocessing systems employing hierarchical back-off locks. US, 2006.
- System and method for reducing shared memory write overhead in multiprocessor systems. US, 2006.
- TMA: A Trap-based Memory Architecture. In Proc. 20th ACM International Conference on Supercomputing, pp 259-268, 2006.
- Adaptive Coherence Batching for Trap-Based Memory Architectures. information Technology - Technical reports nr 2005-016, Uppsala Universitet, dept of information technology, 2005. (External link).
- Exploring Processor Design Options for Java Based Middleware. In Proceedings of the 2005 International Conference on Parallel Processing (ICPP-05), 2005.
- Fast Data-Locality Profiling of Native Execution. In ACM SIGMETRICS Performance Evaluation Review, volume 33, number 1, pp 169-180, 2005. (DOI).
- Flexibility Implies Performance. Information Technology - Technical reports nr 2005-013, uppsala universitet, dept of information technology, 2005. (External link).
- Multi-node computer system employing multiple memory response states. 2005.
- Multi-node computer system where active devices selectively initiate certain transactions using remote-type address packets. 2005.
- Multi-node system in which home memory subsystem stores global to local address translation information for replicating nodes. 2005.
- Parallella program ger paradigmskifte. In Elektroniktidningen, number 13, 2005.
- Skewed Caches from a Low-Power Perspective. In Proceedings of Computing Frontiers, Ischia, Italy, May 2005, 2005.
- TMA: A Trap-Based Memory Architecture. information Technology - Technical reports nr 2005-015, Uppsala Universitet, dept of information technology, 2005. (External link).
- Vasa: A Simulator Infrastructure with Adjustable Fidelity. In In Proceedings of the 17th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2005), Phoenix, Arizona, USA, November 2005., 2005. (External link).
- Bundling: Reducing the Overhead of Multiprocessor Prefetchers. In 18th International Parallel and Distributed Processing Symposium: (IPDPS 2004), 2004.
- Computer system employing bundled prefetching. 2004.
- Computer system including a promise array. 2004.
- Evaluation, Implementation and Performance of Write Permission Caching in the DSZOOM System. Technical reports from the Department of Information Technology nr 2004-005, 2004. (External link).
- Exploiting Spatial Store Locality through Permission Caching in Software DSMs. In Proceedings of the 10th International Euro-Par Conference: Parallel Processing, p 551, 2004. (External link).
- Low Power and Conflict Tolerant Cache Design. Technical report / Department of Information Technology, Uppsala University nr 2004-024, 2004. (fulltext).
- Multi-node computer system employing a reporting mechanism for multi-node transactions. 2004.
- Multi-node computer system implementing global access state dependent transactions. 2004.
- Multi-node computer system with proxy transaction to read data from a non-owning memory device. 2004.
- Multi-node system in which global address generated by processing subsystem includes global to local translation information. 2004.
- Multi-node system with global access states. 2004.
- Multi-node system with interface intervention to satisfy coherency transactions transparently to active devices. 2004.
- Multi-node system with split ownership and access right coherence mechanism. 2004.
- Multiprocessing computer system employing capacity prefetching. 2004.
- Multiprocessing systems employing hierarchical back-off locks. 2004.
- Performing virtual to global address translation in processing subsystem. 2004.
- StatCache: A Probabilistic Approach to Efficient and Accurate Data Locality Analysis. In 2004 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2004),, 2004.
- System and method for reducing shared memory write overhead in multiprocessor systems. 2004.
- Bundling: Reducing the Overhead of Multiprocessor Prefetchers. IT Technical Report 2003-037, Uppsala: Department of Information Technology, Uppsala University, 2003. (External link).
- Communication error reporting mechanism in a multiprocessing computer system. 2003.
- Hierarchical Backoff Locks for Nonuniform Communication Architectures. In Proceedings of the Ninth International Symposium on High Performance Computer Architecture (HPCA-9), Anaheim, California, USA, February 2003., 2003. (External link).
- Latency-hiding and Optimizations of the DSZOOM Instrumentation System. IT Technical Report 2003-029, Uppsala: Department of Information Technology, Uppsala University, 2003. (External link).
- Low-Overhead Spatial and Temporal Data Locality Analysis. Technical report / Department of Information Technology, Uppsala University nr 2003-057, Uppsala: Department of Information Technology, Uppsala University, 2003. (fulltext).
- Memory System Behavior of Java-Based Middleware. In Proceedings of the Ninth International Symposium on High Performance Computer Architecture, 2003. (External link).
- Methods and apparatus for a directory-less memory access protocol in a distributed shared memory computer system. 2003.
- Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors. In Proceedings of the 17th InternationalParallel and Distributed Processing Symposium (IPDPS 2003), Nice, France, 2003.
- Multiprocessing systems employing hierarchical spin locks. 2003.
- Queuing delay limiter. 2003.
- Selective address translation in coherent memory replication. 2003.
- StatCache: A Probabilistic Approach to Efficient and Accurate Data Locality Analysis. Technical report / Department of Information Technology, Uppsala University nr 2003-058, Uppsala: Department of Information Technology, Uppsala University, 2003. (fulltext).
- System and method for accessing a shared computer resource using a lock featuring different spin speeds corresponding to multiple states. 2003.
- THROOM: Running POSIX Multithreaded Binaries on a Cluster. Technical report / Department of Information Technology, Uppsala University nr 2003-026, 2003. (fulltext).
- THROOM — Supporting POSIX Multithreaded Binaries on a Cluster. In Euro-Par 2003: Parallel Processing, volume 2790 of Lecture Notes in Computer Science, pp 760-769, Springer-Verlag, Berlin, 2003. (DOI).
- TImestamp-based Selective Cache Allocation. In High Performance Memory Systems, Springer-Verlag, 2003.
- The Elbow Cache: A Power-Efficient Alternative to Highly Associative Caches. Technical report / Department of Information Technology, Uppsala University nr 2003-046, Uppsala: Department of Information Technology, Uppsala University, 2003. (fulltext).
- Efficient Synchronization for Non-Uniform Communication Architectures. In Proceedings of Supercomputing 2002, Baltimore, Maryland, USA, 2002. (External link).
- Hierarchical SMP computer System. 2002.
- Hybrid memory access protocol in a distributed shared memory computer system. 2002.
- Methods and apparatus for a directory-less memory access protocol in a distributed shared memory computer system. 2002.
- RH Lock: A Scalable Hierarchical Spin Lock. In Proceedings of the 2nd Annual Workshop on Memory Performance Issues (WMPI 2002), held in conjunction with the 29th International Symposium on Computer Architecture (ISCA29), Anchorage, Alaska, USA, 2002. (External link).
- Selective address translation in coherent memory replication. 2002.
- Skewed finite hashing function. 2002.
- Cache-less address translation. 2001.
- Communication error reporting mechanism in a multiprocessing computer system. 2001.
- Communication error reporting mechanism in a multiprocessing computer system. 2001.
- Hybrid memory access protocol in a distributed shared memory computer system. 2001.
- Multiprocessing system configured to perform efficient block copy operations. 2001.
- Multiprocessing system configured to perform efficient block copy operations. 2001.
- Multiprocessing system configured to perform efficient block copy operations. 2001.
- Multiprocessor computer system employing a mechanism for routing communication traffic through a cluster node having a slice of memory direct. 2001.
- Selective address translation in coherent memory replication. 2001.
- Shared memory system for symmetric microprocessor systems. 2001.
- Shared memory system for symmetric multiprocessor systems. 2001.
- Skewed finite hashing function. 2001.
- Skewed finite hashing function. 2001.
- Directory-based, shared-memory, scaleable multiprocessor computer system having deadlock-free transaction flow sans flow control protocol. 2000.
- Hybrid queue and backoff computer resource lock featuring different spin speeds corresponding to multiple-states. 2000.
- Method for increasing the speed of data processing in a computer system. 2000.
- WildFire: A Scalable Path for SMPs. In Proc. Fifth Int. Symp. on High-Performance Computer Architecture, pp 172-181, 1999. (DOI, External link).
Clicking on titles below takes you to a page with the abstract and/or links to the report. The corresponding BibTeX file is also available, as well as a preformatted list in Postscript and PDF formats.
Please respect the copyrights. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage. To copy otherwise, or to republish, requires a fee and/or specific permission of the authors and/or ACM/IEEE.
2006
- Multigrid and Gauss-Seidel smoothers revisited: Parallelization on chip multiprocessors by Dan Wallin, Henrik Löf, Erik Hagersten, and Sverker Holmgren. In Proc. 20th ACM International Conference on Supercomputing, ACM Press, New York, pp 145-155, 2006.
- STATSHARE: A Statistical Model for Managing Cache Sharing via Decay by Pavlos Petoumenos, Georgios Keramidas, Håkan Zeffer, Stefanos Kaxiras, and Erik Hagersten. In 2006th Workshop on Modeling, Benchmarking and Simulation held in conjunction: with the 33rd Annual International Symposium on Computer Architecture, Boston, MA USA, June 2006, 2006.
- Modeling Cache Sharing on Chip Multiprocessor Architectures by Pavlos Petoumenos, Georgios Keramidas, Håkan Zeffer, Erik Hagersten, and Stefanos Kaxiras. In Proceedings of the 2006 IEEE International Symposium of Workload Characterization: San Jose, California, USA, 2006.
- Towards Low-Complexity Scalable Shared-Memory Architectures by Håkan Zeffer.PhD Thesis Department of Information Technology, Uppsala University, October 2006. ISBN 91-554-6647-8.
- Iterative and Adaptive PDE Solvers for Shared Memory Architectures by Henrik Löf. PhD Thesis Department of Information Technology, Uppsala University, October 2006. SBN 91-554-6648-6.
- Memory System Behavior of Java-Based Middleware by Martin Karlsson, Kevin E. Moorez, Erik Hagersten, and David A. Wood. In Hans Hansson, editor, ARTES - A network for Real-Time research and graduate Education in Sweden 1997-2006, volume 2006-006 of Technical reports from the Department of Information Technology, Uppsala University, The Department of Information Technology, Uppsala, p 830, 2006.
- TMA: A Trap-Based Memory Architecture by Håkan Zeffer, Zoran Radovic, Martin Karlsson, and Erik Hagersten. In Proceedings of the 20th ACM International Conference on Supercomputing (ICS 2006), Cairns, Queensland, Australia, June 2006.
- A Case For Low-Complexity Multi-CMP Architectures by Håkan Zeffer, and Erik Hagersten. Technical report 2006-031, Department of Information Technology, Uppsala University, June 2006.
- Methods for Creating and Exploiting Data Locality by Dan Wallin. PhD Thesis Department of Information Technology, Uppsala University, May 2006. ISBN 91-554-6555-2.
- Exploiting Locality: A Flexible DSM Approach by Håkan Zeffer, Zoran Radovic, and Erik Hagersten. In Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2006), Rhodes Island, Greece, April 2006.
- A Statistical Multiprocessor Cache Model by Erik Berg, Håkan Zeffer, and Erik Hagersten. In Proceedings of the 2006 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2006), Austin, Texas, USA, March 2006.
- Memory System Design for Chip-Multiprocessors by Martin Karlsson. PhD Thesis Department of Information Technology, Uppsala University, January 2006. ISBN 91-554-6429-7.
2005
- Software Techniques for Distributed Shared Memory by Zoran Radovic. PhD thesis, Department of Information Technology, Uppsala University, November 2005.
- Vasa: A Simulator Infrastructure with Adjustable Fidelity by Dan Wallin, Håkan Zeffer, Martin Karlsson, and Erik Hagersten. In Proceedings of the 17th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2005), Phoenix, Arizona, USA, November 2005.
- Efficient and Flexible Characterization of Data Locality through Native Execution Sampling by Erik Berg. PhD thesis, Department of Information Technology, Uppsala University, November 2005.
- Exploring Processor Design Options for Java Based Middleware by Martin Karlsson, Kevin Moore, Erik Hagersten, and David Wood. In Proceedings of the 2005 International Conference on Parallel Processing (ICPP-05), Oslo, Norway, June 2005.
- Fast Data-Locality Profiling of Native Execution by Erik Berg and Erik Hagersten. In Proceedings of ACM SIGMETRICS 2005, Banff, Canada, June 2005.
- Hardware-Software Tradeoffs in Shared-Memory Implementations by Håkan Zeffer. Licentiate Thesis 2005-002, Department of Information Technology, Uppsala University, May 2005.
- Adaptive Coherence Batching for Trap-Based Memory Architectures by Håkan Zeffer and Erik Hagersten. Technical report 2005-016, Department of Information Technology, Uppsala University, May 2005.
- Skewed Caches from a Low-Power Perspective by Mathias Spjuth, Martin Karlsson and Erik Hagersten. In Proceedings of Computing Frontiers, Ischia, Italy, May 2005.
- TMA: A Trap-Based Memory Architecture by Håkan Zeffer, Zoran Radovic, Martin Karlsson, and Erik Hagersten. Technical report 2005-015, Department of Information Technology, Uppsala University, May 2005.
- Flexibility Implies Performance by Håkan Zeffer, Zoran Radovic, and Erik Hagersten. Technical report 2005-013, Department of Information Technology, Uppsala University, April 2005.
2004
- Reorganisation in the Skewed-Associative TLB by Thorild Selén. Technical report 2004-027, Department of Information Technology, Uppsala University, September 2004. (Master's thesis)
- Exploiting Spatial Store Locality through Permission Caching in Software DSMs by Håkan Zeffer, Zoran Radovic, Oskar Grenholm, and Erik Hagersten. In Proceedings of the 10th International Euro-Par Conference (Euro-Par 2004), Pisa, Italy, August 2004.
- Low Power and Conflict Tolerant Cache Design by Mathias Spjuth, Martin Karlsson, and Erik Hagersten. Technical report 2004-024, Department of Information Technology, Uppsala University, May 2004.
- Bundling: Reducing the Overhead of Multiprocessor Prefetchers by Dan Wallin and Erik Hagersten. In Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), Santa Fe, New Mexico, USA, April 2004.
- StatCache: A Probabilistic Approach to Efficient and Accurate Data Locality Analysis by Erik Berg and Erik Hagersten. In Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2004), Austin, Texas, USA, March 2004.
- Evaluation, Implementation and Performance of Write Permission Caching in the DSZOOM System by Håkan Zeffer, Zoran Radovic, Oskar Grenholm, and Erik Hagersten. Technical report 2004-005, Department of Information Technology, Uppsala University, February 2004.
- Improving DSZOOM's Run Time System by Niklas Ekström. Master's thesis, UPTEC F03 104, ISSN 1401-5757, School of Engineering, Uppsala University, Sweden, January 2004.
2003
- Methods for Run Time Analysis of Data Locality by Erik Berg. Licentiate Thesis 2003-015, Department of Information Technology, Uppsala University, December 2003.
- StatCache: A Probabilistic Approach to Efficient and Accurate Data Locality Analysis by Erik Berg and Erik Hagersten. Technical report 2003-058, Department of Information Technology, Uppsala University, November 2003.
- Low-Overhead Spatial and Temporal Data Locality Analysis by Erik Berg and Erik Hagersten. Technical report 2003-057, Department of Information Technology, Uppsala University, November 2003.
- Exploiting Data Locality in Adaptive Architectures by Dan Wallin. Licentiate Thesis 2003-010, Department of Information Technology, Uppsala University, September 2003.
- Cache Memory Design Trade-offs for Current and Emerging Workloads by Martin Karlsson. Licentiate Thesis 2003-009, Department of Information Technology, Uppsala University, September 2003.
- Efficient Synchronization and Coherence for Nonuniform Communication Architectures by Zoran Radovic. Licentiate Thesis 2003-008, Department of Information Technology, Uppsala University, September 2003.
- The Elbow Cache: A Power-Efficient Alternative to Highly Associative Caches by Mathias Spjuth, Martin Karlsson, and Erik Hagersten. Technical report 2003-046, Department of Information Technology, Uppsala University, September 2003.
- Cache Memory Behavior of Advanced PDE Solvers by Dan Wallin, Henrik Johansson, and Sverker Holmgren. In Proceedings of Parallel Computing 2003 (ParCo2003), Dresden, Germany, September 2003.
- Cache Memory Behavior of Advanced PDE Solvers by Dan Wallin, Henrik Johansson, and Sverker Holmgren. Technical Report 2003-044, Department of Information Technology, Uppsala University, August 2003.
- Bundling: Reducing the Overhead of Multiprocessor Prefetchers by Dan Wallin and Erik Hagersten. Technical Report 2003-037, Department of Information Technology, Uppsala University, August 2003.
- THROOM - Supporting POSIX Multithreaded Binaries on a Cluster by Henrik Löf, Zoran Radovic, and Erik Hagersten. In Proceedings of the 9th International Euro-Par Conference (Euro-Par 2003), Klagenfurt, Austria, August 2003.
- Latency-hiding and Optimizations of the DSZOOM Instrumentation System by Oskar Grenholm, Zoran Radovic, and Erik Hagersten. Technical Report 2003-029, Department of Information Technology, Uppsala University, May 2003.
- THROOM - Running POSIX Multithreaded Binaries on a Cluster by Henrik Löf, Zoran Radovic, and Erik Hagersten. Technical Report 2003-026, Department of Information Technology, Uppsala University, April 2003.
- Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors by Dan Wallin and Erik Hagersten. In Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), Nice, France, April 2003.
- Performance of PDE Solvers on a Self-Optimizing NUMA Architecture by Sverker Holmgren, Jarmo Rantakokko, Markus Norden, and Dan Wallin. In Journal of Parallel Algorithms and Applications, vol. 17, no. 4, pp. 285-299, 2003.
- Memory System Behavior of Java-Based Middleware by Martin Karlsson, Kevin Moore, Erik Hagersten, and David Wood. In Proceedings of the Ninth International Symposium on High Performance Computer Architecture (HPCA-9), Anaheim, California, USA, February 2003.
- Hierarchical Backoff Locks for Nonuniform Communication Architectures by Zoran Radovic and Erik Hagersten. In Proceedings of the Ninth International Symposium on High Performance Computer Architecture (HPCA-9), Anaheim, California, USA, February 2003.
2002
- Simple and Efficient Instrumentation for the DSZOOM System by Oskar Grenholm. Master's thesis, UPTEC F-02-096, ISSN 1401-5757, School of Engineering, Uppsala University, Sweden, December 2002.
- Efficient Synchronization for Nonuniform Communication Architectures by Zoran Radovic and Erik Hagersten. In Proceedings of Supercomputing 2002 (SC2002), Baltimore, Maryland, USA, November 2002.
- SIP: Performance Tuning through Source Code Interdependence by Erik Berg and Erik Hagersten. In Proceedings of the 8th International Euro-Par Conference (Euro-Par 2002), Paderborn, Germany, August 2002.
- Memory Characterization of the ECperf Benchmark by Martin Karlsson, Kevin Moore, Erik Hagersten, and David Wood. In Proceedings of the 2nd Annual Workshop on Memory Performance Issues (WMPI 2002), held in conjunction with the 29th International Symposium on Computer Architecture (ISCA29), Anchorage, Alaska, USA, May 2002.
- RH Lock: A Scalable Hierarchical Spin Lock by Zoran Radovic and Erik Hagersten. In Proceedings of the 2nd Annual Workshop on Memory Performance Issues (WMPI 2002), held in conjunction with the 29th International Symposium on Computer Architecture (ISCA29), Anchorage, Alaska, USA, May 2002.
- Refinement and Evaluation of the Elbow Cache by Mathias Spjuth. Master's thesis, UPTEC F-02-033, ISSN 1401-5757, School of Engineering, Uppsala University, Sweden, April 2002.
- Temporal Debugging and Profiling of Multimedia Applications by Lars Albertsson. In Proceedings of Multimedia Computing and Networking 2002, San José, California, USA, January 2002.
2001
- Removing the Overhead from Software-Based Shared Memory by Zoran Radovic and Erik Hagersten. In Proceedings of Supercomputing 2001 (SC2001), Denver, Colorado, USA, November 2001.
- Performance of a High-Accuracy PDE Solver on a Self Optimizing NUMA Architecture by Sverker Holmgren and Dan Wallin. In Proceedings of the 7th International Euro-Par Conference (Euro-Par 2001), Manchester, UK, August 2001.
- Timestamp-Based Selective Cache Allocation by Martin Karlsson and Erik Hagersten. In High Performance Memory Systems, edited by H. Hadimiouglu, D. Kaeli, J. Kuskin, A. Nanda, and J. Torrellas, Springer-Verlag, 2003. Also published in Proceedings of the Workshop on Memory Performance Issues (WMPI 2001), held in conjunction with the 28th International Symposium on Computer Architecture (ISCA28), Göteborg, Sweden, June 2001.
- Implementing Low Latency Distributed Software-Based Shared Memory by Zoran Radovic and Erik Hagersten. In Proceedings of the Workshop on Memory Performance Issues (WMPI 2001), held in conjunction with the 28th International Symposium on Computer Architecture (ISCA28), Göteborg, Sweden, June 2001.
- Simulation-Based Debugging of Soft Real-Time Applications by Lars Albertsson. In Proceedings of the Real-Time Application Symposium, IEEE Computer Society, IEEE Computer Society Press, May 2001.
- DSZOOM--Low Latency Software-Based Shared Memory by Zoran Radovic and Erik Hagersten. Technical Report 2001:03, Parallel and Scientific Computing Institute (PSCI), Sweden, April 2001.
- Performance of a High-Accuracy PDE Solver on a Self Optimizing NUMA Architecture by Dan Wallin. Master's thesis, UPTEC F-01-017, ISSN 1401-5757, School of Engineering, Uppsala University, Sweden, February 2001.
2000
- DSZOOM--Low Latency Software-Based Shared Memory by Zoran Radovic. Master's thesis, UPTEC F-00-093, ISSN 1401-5757, School of Engineering, Uppsala University, Sweden, December 2000.
- Simulation-Based Temporal Debugging of Linux by Lars Albertsson and Peter S. Magnusson. In Proceedings of the 2nd Real-Time Linux Workshop, Lake Buena Vista, Florida, USA, November 2000.
- Using Complete System Simulation for Temporal Debugging of General Purpose Operating Systems and Workloads by Lars Albertsson and Peter S. Magnusson. In Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS 2000), San Francisco, California, USA, August 2000.
- Shared-Memory Multiprocessing: Current State and Future Directions by P. Stenström, E. Hagersten, D. Lilja, M. Martonosi, and M. Venugopal. In Advances in Computers, Marvin Zelkowitz (editor), Academic Press, Vol. 53, pages 2--46, 2000.
1999
- Scanning the DSM Technology by Erik Hagersten and Greg Papadopoulos. In Proceedings of the IEEE, March 1999.
- WildFire: A Scalable Path for SMPs by Erik Hagersten and Michael Koster. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture (HPCA-5), pages 172--181, Orlando, Florida, USA, January 1999.
Some of Erik's Old Papers
Complete list can be found here.
- Gigaplane: A High Performance Bus for Large SMPs by Ashok Singhal, David Broniarczyk, Frederick Cerauskis, Jeff Price, Leo Yuan, Chris Cheng, Drew Doblar, Steve Fosth, Nalini Agarwal, Kenneth Harvey, and Erik Hagersten. In Proceedings of the IEEE Hot Interconnect IV, pages 41--52, Stanford University, August 1996.
- Queue Locks on Cache Coherent Multiprocessors by Peter S. Magnusson, Anders Landin, and Erik Hagersten. In Proceedings of the IPPS, Cancun, Mexico, January 1994. (The original LH lock paper.)
- Simple COMA Node Implementations by Erik Hagersten, Ashley Saulsbury, and Anders Landin. In Proceedings of the Hawaii International Conference on System Sciences (HICSS), January 1994. (The original Simple COMA paper.)
- Simulating the Data Diffusion Machine by Erik Hagersten, Mats Grindal, Anders Landin, Ashley Saulsbury, Bengt Werner, and Seif Haridi. In Proceedings of the Parallel Architecture and Languages Europe (PARLE=EUROPAR), Springer-Verlag, June 1993. (Best Presentation Award.)
- Toward Scalable Cache Only Memory Architectures by Erik Hagersten. PhD thesis, Department of Telecommunications and Computer Systems, The Royal Institute of Technology, Stockholm, Sweden, October 1992.
- Race-free Interconnection Networks and Multiprocessor Consistency by Anders Landin, Erik Hagersten, and Seif Haridi. In Proceedings of the 18th International Symposium on Computer Architecture (ISCA), vol. 19, no. 3, pages 106--115, Toronto, Canada, May 1991.