









· In this example, the task is decomposed into 3 operations: Reading,

Computer Architecture

Example (cont'd):

multiplication, and addition.







# Computer Architecture Speedup: k: number of stages in the pipeline t<sub>p</sub>: cycle time n: number of tasks $t_{\mbox{\scriptsize n}}$ : time required for a task without pipelining Calculation of the total time required for n tasks: • k cycles required to complete the first task (T1). Time: $T(1) = k \cdot t_p$ remaining n-1 tasks require (n-1) cycles. Total time required for n tasks: $\frac{+ \text{ Time: } (n-1)t_p}{T(n) = (k+n-1)tp}$ Execution time without the pipeline $S = \frac{n \cdot t_n}{(k+n-1) \cdot t_p}$ If the number of tasks increases significantly : n $\rightarrow$ $\infty$ , $\sum_{\substack{n \\ \text{limits}}} = \frac{t_n}{t_n}$ If we assume $t_n=k\cdot t_p$ , (If it were possible to divide the main task into k equal small operations and ignore the register delays, the cycle time would be $t_p = t_n / k$ .) $S_{max} = k$ (Theoretical maximum speedup)







# License: https://creativecommons.org/licenses/by-nc-nd/4.0/ 2.4 Instruction Pipeline (Instruction-Level Parallelism) During the execution of each instruction the CPU repeats some operations. The processing required for a single instruction is called an instruction cycle. An instruction cycle is generally composed of these stages: instruction fetch and decoding, operand fetch, execution, interrupt. (See the figure on 1.18) The simplest instruction pipeline can be constructed with two stages: 1) Fetch and decode instruction 2) Fetch operands and execute instruction When the main memory is not being accessed during the execution of an instruction, this time can be used to fetch the next instruction in parallel with the execution of the current one. Example: Cycle: 1 2 Instr. 1 Fetch, decode Operand, exec. Fetch decode 3 Instr. 2 Fetch, decode Operand, exec. Fetch, decode Operand, exec. The potential overlap among instructions is called instruction-level parallelism. Remember: To gain more speedup, the pipeline must have more stages with short http://akademi.itu.edu.tr/en/buzluca © ① ⑤ © 2013 - 2021 Feza BUZLUCA 2.13



The instruction cycle can be decomposed into 6 operations to gain more speedup:

Decode instruction (DI): Determine the opcode and the operand specifiers.

1. **Fetch instruction (FI)**: Read the next expected instruction into a buffer.

Calculate addresses of operands (CO): Calculate the effective address. Fetch operands (FO): Fetch each operand from memory.

Execute instruction (EI): Perform the indicated operation.

6. Write operand (WO): Store the result in memory.

Instruction Pipeline (cont'd)

# 1. FI (Fetch Instruction): Read the next instruction the PC points to into a buffer. 2. DA (Decode, Address): Decode instruction, calculate operand addresses 3. FO (Fetch Operand): Read operands (memory/register) 4. EX (Execution): Perform the operation and update the registers (including the PC in branch/jump instructions) 1. In order to perform instruction and operand fetch operations at the same time, we assume that the processor has separate instruction and data memories. 4. Memory-write operations are ignored in these examples. 5. This an exemplary pipelined CPU. More realistic examples are given in section "2.4.2 An Exemplary RISC Processor with Pipelining".

2.4.1 An (exemplary) instruction pipeline (with 4 stages)

Computer Architecture





| Computer Architecture                              |                                                                                                                                                                            |  |  |  |  |
|----------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| 2.4.1 An (exemplary) instruction pipeline (cont'd) |                                                                                                                                                                            |  |  |  |  |
| B.2 Control Hazards (Branches, Interrupts):        |                                                                                                                                                                            |  |  |  |  |
|                                                    | cesses instructions in parallel, during the processing of a<br>the next instruction in the memory that should be actually<br>the pipeline.                                 |  |  |  |  |
|                                                    | chanism is necessary; otherwise, the instruction(s) that should ng to the program will also be executed.                                                                   |  |  |  |  |
| Example:                                           |                                                                                                                                                                            |  |  |  |  |
| <ol> <li>Instruction</li> </ol>                    |                                                                                                                                                                            |  |  |  |  |
| <ol> <li>JUMP T</li> <li>Instructi</li> </ol>      |                                                                                                                                                                            |  |  |  |  |
| 4. Target Instruction                              | n_4 < Target of the branch (target instruction)                                                                                                                            |  |  |  |  |
| is also fetched int<br>To prevent the pro          | ing of the unconditional branch instruction JUMP, Instruction_3 the pipeline. gram from running incorrectly, the pipeline must be stopped efore Instruction_3 is executed. |  |  |  |  |
| http://akademi.itu.edu.tr/en/bu                    | duca/ (C) (S) 2013 - 2021 Feza BUZLUCA 2.18                                                                                                                                |  |  |  |  |



































| Computer Architecture                                                                                                                                                                                                |  |  |  |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| 2.5 Pipeline Hazards (Conflicts) and Solutions                                                                                                                                                                       |  |  |  |  |
| There are three types of hazards                                                                                                                                                                                     |  |  |  |  |
| 1. Resource Conflict (Structural hazard):                                                                                                                                                                            |  |  |  |  |
| A resource hazard occurs when two (or more) instructions that are already in the pipeline need the same resource (memory, functional unit).                                                                          |  |  |  |  |
| 2. Data Conflict (Hazard)                                                                                                                                                                                            |  |  |  |  |
| Data hazards occur when data is used before it is ready.                                                                                                                                                             |  |  |  |  |
| 3. Control Hazards (Branch, Jump, Interrupt):                                                                                                                                                                        |  |  |  |  |
| During the processing of a branch instruction, the next instruction in the memory that should actually be skipped also enters the pipeline.                                                                          |  |  |  |  |
| Which <b>target instruction</b> should be fetched into the pipeline is unknown, unless the CPU executes the branch instruction (updating the PC).                                                                    |  |  |  |  |
| Conditional branch problem: Until the actual execution of the instruction that alters the flag values, the flag values are unknown (greater?, equal?), so it is impossible to determine if the branch will be taken. |  |  |  |  |
| Stalling solves all these conflicts but it reduces the performance of the system.                                                                                                                                    |  |  |  |  |
| There are more efficient solutions.                                                                                                                                                                                  |  |  |  |  |
| http://akademi.itu.edu.tr/en/buzluca/<br>http://www.buzluca.info 2013 - 2021 Feza BUZLUCA 2.36                                                                                                                       |  |  |  |  |



© **(3)** © 2013 - 2021 Feza BUZLUCA 2.37

http://akademi.itu.edu.tr/en/buzluca

















































# Important points about changing the order of the instructions:

- An instruction from before the branch can be placed immediately after the branch.
- The branch (condition or address ) must not depend on the moved instruction.
- This method (if possible) always improves the performance (compared to inserting NOOP).
- Especially for conditional branches, this procedure must be applied carefully. If the condition that is tested for the branch is altered by the immediately preceding instruction, then the complier cannot move this instruction to after the branch.

# Other possibilities:

The compiler can select instructions to move

- · From branch target
- Must be OK to execute moved instruction even if the branch is not taken
- Improves performance when branch is taken
- From fall through (else)
- Must be OK to execute moved instruction even if the branch is taken
- Improves performance when branch is not taken

http://akademi.itu.edu.tr/en/buzluca/

@ ① ③ ② 2013 - 2021 Feza BUZLUCA

# Solutions to Control (Branch) Hazards (cont'd):

## D) Branch Prediction:

Remember: The existence of branch/jump instructions in the program causes two main problems:

The CPU does not know the target instruction to fetch into the pipeline until it calculates the target address of the branch instruction.

PC ← PC + offset

Later stages of the pipeline (not IF stage) carry out this calculation. Options:

- a) If address calculation is in EX and result is sent from EX/ME register to IF stage (slide 2.32), branch penalty = 3 cycles.
- b) If address calculation is in EX and result is directly sent to IF stage (slide 2.53), branch penalty = 2 cycles.
- c) If address calculation is in DR and result is directly sent to IF stage (slide 2.55), branch penalty = 1 cycle (valid for unconditional branch/jump

To solve this problem, a branch target table (slide 2.66) is used to determine the target address in advance.

The branch target table is a cache memory in the IF stage that keeps the addresses of the branch instructions and their target addresses.

http://akademi.itu.edu.tr/en/buzluca/

@ ⊕ ⊕ 2013 - 2021 Feza BUZLUCA

# Computer Architecture

Branch/jump instructions in the program cause two main problems (cont'd):

2. Conditional branch problem: Until the previous instruction is actually executed, it is impossible to determine whether the branch will be taken or not because the values of the flags are not known.

> If the branch is not taken,  $PC \leftarrow PC + 4$  (1 instruction = 4 bytes for the exemplary RISC processor )

PC ← PC + offset If the branch is taken,

- a) If the branch decision logic is in ME stage (after EX) (slide 2.32), branch penalty = 3 cycles.
- b) If the branch decision logic is in EX (slide 2.53), branch penalty = 2 cycles. To solve this problem, prediction mechanisms are used

When a conditional branch is recognized, a branch prediction mechanism predicts whether the branch will be taken or not.

According to the prediction, either the next instruction in the memory or the target instruction of the branch is prefetched.

If the prediction is correct, there is no branch penalty.

In case of misprediction, the pipeline must be stopped and emptied.

http://akademi.itu.edu.tr/en/buzluca/

@ ① ⑤ ② 2013 - 2021 Feza BUZLUCA

Computer Architecture

# D) Branch Prediction (cont'd):

There are two types of branch prediction mechanisms: static and dynamic.

# Static branch prediction strategies:

- a) Always <u>predict not taken</u>: Always assumes that the branch will not be taken and fetches the next instruction in sequence.
- b) Always <u>predict taken</u>: Always predicts that the branch will be taken and fetches the target instruction of the branch.

To determine the target of the branch in advance (without calculation), the ranch target table is used (slide 2.66).

Studies analyzing program behavior have shown that conditional branches are taken more than 50% of the time.

Therefore, always prefetching from the branch target address should give better performance than always prefetching from the sequential path.

http://akademi.itu.edu.tr/en/buzluca/

@ ① ⑤ ② 2013 - 2021 Feza BUZLUCA

Computer Architecture

# Branch target table (BTT): Target Instruction prefetch

"Always predict taken" strategy: Always fetches target instruction of the branch. However, the CPU does  ${f not}$  know the target instruction to fetch into the pipeline until it calculates the target address of the branch instruction.

To determine the target of the branch in advance, the branch target table (BTT) is used.

In the branch target table, addresses of recent branch instructions and their target addresses (where they jump) are kept in a cache memory (Chapter 6). The BTT makes it possible for the target instruction to be prefetched in the stage (IF) without calculating the branch target address.

There is a separate row for each branch instruction that has recently run. The number of recent branch instructions stored is limited by the size of the table When a branch instruction runs for the first time, its target address is calculated and written into the BTT

|                                | Branch instruction add | lr. Target address | Example:            |
|--------------------------------|------------------------|--------------------|---------------------|
| One row for                    | \$A000                 | \$B000             |                     |
| each branch                    |                        |                    | \$A000 BGT Target   |
| instruction that 🍈             |                        |                    |                     |
| has recently run.              |                        |                    | \$B000 Target ADD   |
| http://akademi.itu.edu.tr/en/b | uzluca/                | @ 08e              | 2004 5 011711104 24 |

License: https://creativecommons.org/licenses/bv-nc-nd/4.0/

## D) Branch Prediction (cont'd):

## Dynamic branch prediction strategies:

Dynamic branch strategies record the history of all conditional branch instructions in the active program to predict whether the condition will be true or not

One or more prediction bits (or counters) are associated with each conditional branch instruction in a program that reflect the recent history of the

These prediction bits are kept in a branch history table - BHT (slide 2.69) and they provide information about the branch history of the instruction (branch was taken or not in previous runs).

http://akademi.itu.edu.tr/en/buzluca/

@ 000 2013 - 2021 Feza BUZLUCA

# 1-bit dynamic prediction scheme:

For each conditional branch instruction (i), a single individual prediction bit  $(p_i)$  is stored in the branch history table (BHT).

The prediction bit  $\mathbf{p}_i$  records only whether the last execution of this instruction (i) resulted in a branch or not

If the branch was taken last time, the system predicts that the branch will be taken next time.

# Algorithm:

Fetch the i<sup>th</sup> conditional branch instruction

If (p<sub>i</sub> = 0) then predict **not to take** the branch, fetch the next instruction in sequence If  $(p_i = 1)$  then predict to take the branch, prefetch the target instruction of the branch If the branch is really taken, then  $p \leftarrow 1$ 

If the branch is not really taken, then  $p_i \leftarrow 0$ 

The initial value of  $\mathbf{p}_l$  is determined depending on the case in the first run of the conditional branch instruction.

In the first run, the target address is calculated and stored in the BHT.

During the calculation of the target address, next instructions in sequence (not the target of branch) are fetched into the pipeline. In case of a branch, there will be branch penalty.

http://akademi.itu.edu.tr/en/buzluca

@ ① ③ ② 2013 - 2021 Feza BUZLUCA

# Computer Architecture

# Branch target buffer and branch history table (BHT):

Prediction bits are kept in a high-speed memory location called the branch history table (BHT).

For each recent branch instruction in the current program, the BHT stores the address of the instruction, the target address, and the state (prediction) bits.

Each time a conditional branch instruction is executed the associated prediction bits are updated according to whether the branch is taken or not.

These prediction bits direct the pipeline control unit to make the decision the next time the branch instruction is encountered.

If the prediction is that "the branch will be taken", with the help of the target buffer, the target instruction of the branch can be prefetched without calculating the branch address. State

(prediction) bits Branch instruction addr Target address Recent conditional BHT: branch Branch history instructions in table program @ ① ⑤ ⑤ 2013 - 2021 Feza BUZLUCA http://akademi.itu.edu.tr/en/buzluca/

## Computer Architecture

Example: 1-bit dynamic prediction scheme and loops:

Prediction mechanisms are advantageous if there are loops in the program.

## Example:

counter  $\leftarrow 100$ : register or memory location LOOP ; instructions in the loop

Decrement counter ; counter  $\leftarrow$  counter - 1

BNZ LOOP ; Branch if not zero (conditional branch, it has a p bit) ; Next instruction after the loop

A) We assume that in the beginning of the given piece of code, the BNZ instruction is in the BHT and the value of its p bit is 1 (predict to take the branch).

In the first iteration (step) of the loop, the prediction at BNZ will be correct and the pipeline will prefetch the correct instruction (beginning of the loop).

The p bit (p=1) is not changed until the last iteration of the loop.

In the last iteration of the loop, the p bit is still 1, and the prediction is to take th branch; however, as the counter is zero, the program will not jump, and it will instead continue with the next instruction following the branch (misprediction).

The p bit of BNZ is cleared (p  $\leftarrow$  0) because the branch is not taken in the last step As a result, in a loop with 100 iterations, there are 99 correct predictions and only one incorrect prediction.

http://akademi.itu.edu.tr/en/buzluca

@ 0 0 0 2013 - 2021 Feza BUZLUCA

# Computer Architecture

Example: 1-bit dynamic prediction scheme and loops (cont'd):

B) If in the beginning of the given piece of code, the BNZ instruction is not in the BHT, the system cannot make a prediction in the first run. After the calculation of the target address of the BNZ, the related information is

written into the BHT. During the calculation of the target address, next instructions in sequence (not

the target of branch) are fetched into the pipeline.

In the first run, the branch is taken, and the program jumps to the beginning of the loop, so there will be a branch penalty.

The initial value of **p** becomes 1 (predict that the branch will be taken). The value of p(p = 1) does not change until the last iteration (step) of the loop.

In the last iteration of the loop, the p bit is still 1, and the prediction is that the branch will be taken; however, as the counter is zero, the program will not jump, and it will instead continue with the next instruction following the branch (misprediction).

The p bit of BNZ is cleared (p  $\leftarrow$  0).

As a result, in a loop with 100 iterations, in the first iteration, a prediction cannot be made. Then, there are 98 correct predictions and one incorrect prediction. In total, there are 2 branch penalties,

http://akademi.itu.edu.tr/en/buzluca/

@ ① ⑤ ② 2013 - 2021 Feza BUZLUCA

Computer Architecture

# Problem with the 1-bit dynamic prediction scheme:

(Nested loops: the same loop is executed many times)

In nested loops, a one-bit prediction scheme will cause two

· one in the first iteration, and

one on exiting

> LOOP BNZ LOOP BNZ LOOP EX

→ LOOP EX

Remember: in the previous example, after exiting the loop, the p bit of the inner BNZ LOOP was 0 ("don't take the branch") (p=0) .

Now, if the same loop runs again (2nd run), in the first iteration (step), the prediction about the BNZ will be "not to take the branch" (p=0).

However, the program will jump to the beginning of the loop (first misprediction). Now, the p bit will be 1 because branch is taken (p  $\leftarrow$  1).

Until the last iteration of the loop, predictions will be correct.

In the last iteration of the loop, there will be a misprediction as in the previous example (second misprediction).

Hence, misprediction will occur twice for each full iteration of the inner loop.

http://akademi.itu.edu.tr/en/buzluca/

@ 0 S = 2013 - 2021 Feza BUZLUCA









|                                        |              | a. Static pre                                                                                                                   | diction       |  |  |
|----------------------------------------|--------------|---------------------------------------------------------------------------------------------------------------------------------|---------------|--|--|
| i)                                     | Always pred  | predict not taken (For this method, a BTT (branch target table) is cessary)                                                     |               |  |  |
|                                        | BNZ LOOP1:   | P1: There is a correct prediction only in the last iteration (exit).  Other predictions are incorrect.  Correct: 1 Incorrect: 9 |               |  |  |
|                                        | BNZ LOOP2:   | : There is a correct prediction only in the last iteration (exit).  Other predictions are incorrect.  Correct : 10x1 = 10       |               |  |  |
|                                        | Total:       | Correct : 11                                                                                                                    | Incorrect: 99 |  |  |
| This method is not suitable for loops. |              |                                                                                                                                 |               |  |  |
|                                        | I his method | is not suitable for loops                                                                                                       | i.            |  |  |

|                    | a. Static prediction (cont                                                                                                      | '4/                              |
|--------------------|---------------------------------------------------------------------------------------------------------------------------------|----------------------------------|
|                    | •                                                                                                                               | •                                |
| ii-1) Always predi | ct taken under the assumption                                                                                                   | that instructions are in the BTT |
| BNZ LOOP1:         | There is a misprediction only in Other predictions are correct.                                                                 |                                  |
|                    | Correct: 9                                                                                                                      | Incorrect: 1                     |
| BNZ LOOP2:         | There is a misprediction only in Other predictions are correct.                                                                 |                                  |
|                    | Correct : 10x9 = 90                                                                                                             | Incorrect : $10 \times 1 = 10$   |
| Total:             | Correct: 99                                                                                                                     | Incorrect: 11                    |
| BNZ LOOP1:         | ict taken under the assumption<br>There are mispredictions only in<br>Other predictions are correct.<br>Correct: 8              |                                  |
| f<br>I             | In the first run of the loop, the<br>irst and last iterations; other p<br>in the 2nd -10th runs, there is a<br>teration (exit). |                                  |
| it                 | Correct : 8+9x9 = 89                                                                                                            | Incorrect : $2+9\times1 = 11$    |

Computer Architecture License: https://creativecommons.org/licenses/bv-nc-nd/4.0/ Solution (cont'd): b. Dynamic prediction with one bit Note: Different prediction bits are used for each branch instruction (Slides 2.68, 2.69). i) Assumption: In the beginning, instructions are in the BHT, and initial decision is to take the branch BNZ LOOP1: There is a misprediction only in the last iteration (exit). Other predictions are correct. Correct: 9 Incorrect: 1  $\ensuremath{\mathsf{BNZ}}\xspace\,\mathsf{LOOP2}\!:$  In the first run of the loop, there is a misprediction only in the last iteration (exit). Other predictions are correct. After the first run, the prediction bit "p" changes to "branch will not be taken" Therefore, in the 2nd-10th runs, there are mispredictions in both the first and last iterations (Slide 2.71). Correct: 9 + 9x8 = 81 Incorrect: 1+ 9x2 =19 Correct: 90 Incorrect: 20 Total: http://akademi.itu.edu.tr/en/buzluca © ⊕ ⊕ 2013 - 2021 Feza BUZLUCA 2.79

Computer Architecture b. Dynamic prediction with one bit (cont'd): ii) In the beginning instructions are NOT in the BHT, or the initial decision is NOT  $\,$ to take the branch BNZ LOOP1: There are mispredictions in the first and last iterations. Other predictions are correct. Correct: 8 Incorrect: 2 BNZ LOOP2: There are mispredictions in the first and last iterations. Other predictions are correct. Correct: 10x8 = 80 Incorrect: 10x2 = 20 Correct: 88 Incorrect: 22 Total: http://akademi.itu.edu.tr/en/buzluca © © © 2013 - 2021 Feza BUZLUCA

c. Dynamic prediction with two bits:

i) Assumption: In the beginning, instructions are in the BHT, and the initial decision is to take the branch, prediction bits are 11.

BNZ LOOP1: There is a misprediction only in the last iteration (exit). Other predictions are correct.

Correct: 9 Incorrect: 1

BNZ LOOP2: There is a misprediction only in the last iteration (exit). Other predictions are correct.

Correct: 10x9 = 90 Incorrect: 10x1 = 10

Total: Correct: 99 Incorrect: 11

Computer Architecture c. Dynamic prediction with two bits (cont'd):  $\it ii)$  In the beginning, instructions are NOT in the BHT In the first run of the BNZ instructions, since the target address is unknown, next instructions in sequence (not the target of the branch) are fetched into the pipeline. Hence, there is a misprediction in the first iteration. After the CPU has decided to branch and the target address has been calculated, information about the BNZ is stored in the BHT, and prediction bits are set to 11.  $\ensuremath{\mathsf{BNZ}}\xspace\,\mathsf{LOOP1}\xspace$  There are mispredictions in the first and last iterations. Correct: 8 Incorrect: 2 BNZ LOOP2: In the first run, there are mispredictions in the first and last iterations. After the first run the decision is still "branch will be taken". Therefore, in the 2nd - 10th runs, there will be a misprediction only in the last iteration. Correct: 8 + 9x9 = 89Incorrect: 2 + 9x1 = 11Correct: 97 Incorrect: 13 http://akademi.itu.edu.tr/en/buzluca/ © 000 2013 - 2021 Feza BUZLUCA 2.82