Computer Architecture, First Course
Pipelines: Simulation of processor pipeline
In this laboratory work (lab) we will learn how to execute code efficiently on a modern RISC-processor using pipelining.
All assembler code in this lab is going to be written in DLX (pronounced: deluxe). DLX is not a real architecture, i.e., there is no DLX-processor available. The architecture is only developed for teaching about RISC-processors. The architecture is described in detail in the book Computer Architecture - A quantitative approach by Hennessy and Patterson. You will only need to learn a few instructions so there is no need to buy the book. DLX is very similar to the SPARC, MIPS and PowerPC architectures.
-
1. Lab-software: WinDLX
-
2. Pipelining
-
3. Exercise 1 - Integer Execution
-
4. Exercise 2 - Floating Point Execution
-
5. Exercise 3 - Optimization of Assembler Code
-
6. Exercise 4 - Branch prediction
-
7. Presentation
1. Lab-software: WinDLX
The lab is carried out on the program WinDLX, which is installed on PC-lab computers (if you can not find the program on your machine, just follow instructions below and download a new copy). The program is developed by Technicsche Universität Wien. WinDLX is also available for download to any Windows-based PC (two files are needed: windlx.exe and windlx.hlp). A PDF-version of the WinDLX tutorial (wdlxtut.doc) is here.
- Create an own temporary directory by clicking on the right mouse button on the background and choosing "new". Sometimes problems can occur if you put the directory in another position in the file tree.
- The files ex1.s, ex2.s, ex3.s and ex4.s are DLX assembler example files which are used in the lab.
- Be aware of the file suffix. Sometimes Windows adds the suffix .txt to downloaded files. Use the program Notepad (Anteckningar) to edit the files and make sure the suffix is .s. WinDLX normally only finds files with the suffix .s.
- When you have finished the lab, you should delete the temporary directory you have added.
2. Pipelining
A RISC-processor has several independent parts, e.g., a load/store unit, a multiplication unit and an addition unit. They can perform their respective tasks concurrently. Therefore, during the same clock cycle, an addition, a multiplication and a load from the memory can be handled.
Each assembler instruction can be divided in several smaller substeps.
- IF - instruction fetch: Load the instruction from the memory to an instruction register.
- ID - instruction decode: Decode the instruction and read the registers the instruction is fetching data from.
- EX - execute: Execute the operation, that is, let the ALU in the processor perform its task.
- MEM - memory access/branch completion cycle: Only for loads, stores and branches. Read or write from memory and perform possible jumps.
- WB - writeback: Write the result back to the register.
Each substep can only be performed if data are available when the operation has finished its previous task. If this is not the case, the execution has to wait until data are available or the next unit is free. This is called a pipeline stall.
Gurphur M. Prabhu has written a more thorough description of pipelines, caches and the DLX architecture. For more information click here!
3. Exercise 1 - Integer Execution
1. Open the file ex1.s in Notepad. What does the program do?
Some explanations to the DLX assembler instructions:
Instruction | Description | Syntax |
---|---|---|
lw | load word from memory | lw register1, address1 |
add | add two integers | add register1, register2, register3 |
sw | store word to memory | sw address1, register1 |
trap | exit program | trap 0 |
2. Load the file ex1.s in WinDLX: Choose File => Load Code or Data => choose ex1.s => select => Load
3. Open the windows Code, Register and Clock cycle diagram so that you can see them all. The code can be run by choosing Run (F5) or Single cycle (F7) in the execute-menu.
4. Step through the program and answer the following questions:
- During what cycle does register r1 contain the value of n1?
- During what cycle does register r2 contain the value of n2?
- When can the ALU perform the addition? Are there any stalls? What kind of stalls?
- How many clock cycles does it take to run the program? If there were no stalls, how many cycles would the execution take?
If you don't want to count the number of clock cycles by hand, you can reorder the numbering by choosing Absolute Cycle Count in the Configuration menu.
4. Exercise 2 - Floating Point Execution
This exercise is very similar to the previous exercise. The difference is that in this exercise floating numbers are used instead of integers. We are using the data type float (size 4 bytes), e.g., single precision. Floats are stored in the floating point registers f0-f31.
Floating operations are normally more complex than integer operations. This makes floating operations take longer time to execute. In WinDLX it is possible to configure the number of units in the processor for floating point operations and also the execution time for each floating operation. As a default, a floating addition takes 2 cycles, a floating multiplication 5 cycles and a floating division 19 cycles.
1. Reset DLX by choosing File => Reset All
2. Load the file ex2.s in WinDLX: Choose File => Load Code or Data => choose ex2.s => select => Load
3. What is the difference between ex1.s and ex2.s?
Some new DLX-instructions:
Instruction | Description | Syntax |
---|---|---|
lf | load float from memory | lf register1, address1 |
cvti2f | convert integer to float | cvti2f register1, register2 |
multf | multiply floats | multf register1, register2, register3 |
sf | store float | sf address1, register1 |
4. Exercise: Step through the program and answer the following questions:
- During what cycle does register f5 contain the value of n1?
- During what cycle does register f6 contain the value of n2?
- When can the ALU perform the multiplication? Are there any stalls? What kind of stalls?
- How many clock cycles does it take to run the program? If there were no stalls, how many cycles would the execution take?
5. Exercise 3 - Optimization of Assembler Code
In this exercise you should start with ex3.s. In the program a computation with four variables p, q, r and s are carried out. The expression 2p+2q+2r+s*r+s/q+r/q is to be calculated.
Instruction | Description | Syntax |
---|---|---|
addf | add floats | addf register1, register2, register3 |
divf | divide floats | divf register1, register2, register3 |
The exercise is to change the order of computations so that it becomes more efficient, that is fewer cycles are needed for the computations. You are allowed to perform more or less additions, multiplications or divisions than in the original code. The only limitation is that the same result should be available in the f1 register after the program execution. The first 25 lines of code should not be changed. They contain code which set p, q, r and s in register f5-f8. It should be possible to change the values of p, q, r and s between the runs. The set-up of the variables takes 17 cycles before the "useful" computations are begun.
The execution takes 77 cycles in the original code. To pass the exercise you have to perform the computations in less or equal than 44 cycles. The number of cycles should be counted until the message "Trap #0 occurred" is displayed. The person who manages best with eliminating the number of necessary cycles might win a small prize... You are allowed to use all registers for the computations.
Questions:
- What should you think about when programming a modern pipelined processor?
- What should you think about when it comes to floating points divisions?
6. Exercise 4 - Branch prediction
Modern processors have branch prediction schemes which try to predict whether branches are taken or not in assembler code.
Instruction | Description | Syntax |
---|---|---|
bnez | branch not equal to zero | bnez register1, label |
DLX uses a very simple branch-not-taken prediction. Download ex4.s and run it through WinDLX. It contains a very simple loop which is run through four times.
Questions:
- What happens when the branch is not taken?
- How many cycles does the execution take?
- If we could change the DLX hardware predictor to branch-always-taken prediction, how many cycles would the execution take?
- Modern processors often have advanced dynamic branch predictors. Give an example of how such branch predictors work?
- Why do correct branch predictions become more important every year in pipelined processors?
7. Presentation
The presentation will be oral. However, it is important that you bring printouts of all code and answers you have written to the presentation.