# Pipeline

2022年8月24日大约 15 分钟

# # Pipeline

## # Abstract

1. data path implications, hazards and examining the performance of pipelines.
2. interaction between pipelining and various aspects of instruction set design
3. etc..

### # What is pipeline?

Pipelining is an implementation technique whereby multiple instructions are overlapped in execution;

In a computer pipeline, each step in the pipeline completes a part of an instruction.

Like the assembly line, different steps are completing different parts of different instructions in parallel. Each of these steps is called a pipe stage or a pipe segment. The stages are connected one to the next to form a pipe—instructions enter at one end, progress through the stages, and exit at the other end, just as cars would in an assembly line.

The time required between moving an instruction one step down the pipeline is a processor cycle.

Because all stages proceed at the same time, the length of a processor cycle is determined by the time required for the slowest pipe stage.

In a computer, this processor cycle is almost always 1 clock cycle.

$process\ cycle = 1\ clock \ cycle$

Pipelining yields a reduction in the average execution time per instruction. If the starting point is a processor that takes multiple clock cycles per instruction, then pipelining reduces the CPI.

## # RISC V Instruction Set

🟢🟢 RISC V 和 ARM 的关系是什么？

All RISC architectures are characterized by a few key properties:

• All operations on data apply to data in registers and typically change the entire register (32 or 64 bits per register).
• The only operations that affect memory are load and store operations that move data from memory to a register or to memory from a register, respectively. Load and store operations that load or store less than a full register (e.g., a byte, 16 bits, or 32 bits) are often available.
• The instruction formats are few in number, with all instructions typically being one size. In RISC V, the register specifiers: rs1, rs2, and rd are always in the same place simplifying the control.

These simple properties lead to dramatic simplifications in the implementation of pipelining, which is why these instruction sets were designed this way.

• 对数据的所有操作都适用于寄存器中的数据，并且通常会更改整个寄存器。

❌❌❌ 第一条和第二条似乎有点矛盾？

• 指令格式的数量很少，所有指令通常都是一个尺寸。

### # The Classic Five-Stage Pipeline for a RISC Processor

• IF: instruction fetch
• ID: instruction decode
• EX: execution
• MEM: memory access
• WB: write-back

To start with, we have to determine what happens on every clock cycle of the processor and make sure we don’t try to perform two different operations with the same data path resource on the same clock cycle.

For example, a single ALU cannot be asked to compute an effective address and perform a subtract operation at the same time. Thus, we must ensure that the overlap of instructions in the pipeline cannot cause such a conflict.

⭐⭐ @todo C-7 的图片可以有效说明这个问题，将来可以添加进来

First, we use separate instruction and data memories, which we would typically implement with separate instruction and data caches.

The use of separate caches eliminates a conflict for a single memory that would arise between instruction fetch and data memory access.

Notice that if our pipelined processor has a clock cycle that is equal to that of the unpipelined version, the memory system must deliver five times the bandwidth. This increased demand is one cost of higher performance.

Second, the register file is used in the two stages: one for reading in ID and one for writing in WB. These uses are distinct, so we simply show the register file in two places. Hence, we need to perform two reads and one write every clock cycle.

To handle reads and a write to the same register (and for another reason, which will become obvious shortly), we perform the register write in the first half of the clock cycle and the read in the second half.

register file 被两个 stages 用了：在 ID 中读取，在 WB 中写入，这两种用法是不同的。

❌❌ 斜体的没有理解。

Third, Figure C.2 does not deal with the PC. To start a new instruction every clock, we must increment and store the PC every clock, and this must be done during the IF stage in preparation for the next instruction. Furthermore, we must also have an adder to compute the potential branch target address during ID. One further problem is that we need the ALU in the ALU stage to evaluate the branch condition. Actually, we don’t really need a full ALU to evaluate the comparison between two registers, but we need enough of the function that it has to occur in this pipestage.

Although it is critical to ensure that instructions in the pipeline do not attempt to use the hardware resources at the same time, we must also ensure that instructions in different stages of the pipeline do not interfere with one another.

This separation is done by introducing pipeline registers between successive stages of the pipeline, so that at the end of a clock cycle all the results from a given stage are stored into a register that is used as the input to the next stage on the next clock cycle. Figure C.3 shows the pipeline drawn with these pipeline registers.

Although many figures will omit such registers for simplicity, they are required to make the pipeline operate properly and must be present. Of course, similar registers would be needed even in a multicycle data path that had no pipelining (because only values in registers are preserved across clock boundaries).

In the case of a pipelined processor, the pipeline registers also play the key role of carrying intermediate results from one stage to another where the source and destination may not be directly adjacent. For example, the register value to be stored during a store instruction is read during ID, but not actually used until MEM; it is passed through two pipeline registers to reach the data memory during the MEM stage. Likewise, the result of an ALU instruction is computed during EX, but not actually stored until WB; it arrives there by passing through two pipeline registers. It is sometimes useful to name the pipeline registers, and we follow the convention of naming them by the pipeline stages they connect, so the registers are called IF/ID, ID/EX, EX/MEM, and MEM/WB.

## # The Major Hurdle of Pipelining—Pipeline Hazards

There are situations, called hazards, that prevent the next instruction in the instruction stream from executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. There are three classes of hazards:

1. Structural hazards arise from resource conflicts when the hardware cannot support all possible combinations of instructions simultaneously in overlapped execution. In modern processors, structural hazards occur primarily in special purpose functional units that are less frequently used (such as floating point divide or other complex long running instructions). They are not a major performance factor, assuming programmers and compiler writers are aware of the lower throughput of these instructions. Instead of spending more time on this infrequent case, we focus on the two other hazards that are much more frequent.
2. Data hazards arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.
3. Control hazards arise from the pipelining of branches and other instructions that change the PC.

Hazards in pipelines can make it necessary to stall the pipeline. Avoiding a hazard often requires that some instructions in the pipeline be allowed to proceed while others are delayed. For the pipelines we discuss in this appendix, when an instruction is stalled, all instructions issued later than the stalled instruction—and hence not as far along in the pipeline—are also stalled. Instructions issued earlier than the stalled instruction—and hence farther along in the pipeline—must continue, because otherwise the hazard will never clear. As a result, no new instructions are fetched during the stall.Wewill see several examples of howpipeline stalls operate in this section— don’t worry, they aren’t as complex as they might sound!

@todo

@todo

### # Reducing the Cost of Branches Through Prediction

As pipelines get deeper and the potential penalty of branches increases, using delayed branches and similar schemes becomes insufficient.

Instead, we need to turn to more aggressive means for predicting branches. Such schemes fall into two classes: low-cost static schemes that rely on information available at compile time and strategies that predict branches dynamically based on program behavior. We discuss both approaches here.

🔴🔴🔴 @todo 分支预测的意义还需要再继续研究

1. 依赖编译时可用信息的低成本静态方案
2. 基于程序行为的动态分支预测

@todo

#### # Dynamic Branch Prediction and Branch-Prediction Buffers

The simplest dynamic branch-prediction scheme is a branch-prediction buffer or branch history table. A branch-prediction buffer is a small memory indexed by the lower portion of the address of the branch instruction. The memory contains a bit that says whether the branch was recently taken or not. This scheme is the simplest sort of buffer; it has no tags and is useful only to reduce the branch delay when it is longer than the time to compute the possible target PCs.

With such a buffer, we don’t know, in fact, if the prediction is correct—it may have been put there by another branch that has the same low-order address bits. But this doesn’t matter. The prediction is a hint that is assumed to be correct, and fetching begins in the predicted direction. If the hint turns out to be wrong, the prediction bit is inverted and stored back.

❌❌❌ 此时有一个问题，分支预测错误之后，流水线被反转了么？

This buffer is effectively a cache where every access is a hit, and, as we will see, the performance of the buffer depends on both how often the prediction is for the branch of interest and how accurate the prediction is when it matches. Before we analyze the performance, it is useful to make a small, but important, improvement in the accuracy of the branch-prediction scheme.

buffer 的性能取决于预测兴趣分支的频率和预测匹配时的准确性。

## # How Is Pipelining Implemented?

Before we proceed to basic pipelining, we need to review a simple implementation of an unpipelined version of RISC V.

### # A Simple Implementation of RISC V

In this subsection, we focus on a pipeline for an integer subset of RISC V that consists of load-store word, branch equal, and integer ALU operations.

Later in this appendix we will incorporate the basic floating-point operations. Although we discuss only a subset of RISC V, the basic principles can be extended to handle all the instructions; for example, adding store involves some additional computing of the immediate field. We initially used a less aggressive implementation of a branch instruction. We show how to implement the more aggressive version at the end of this section.

Every RISC V instruction can be implemented in, at most, 5 clock cycles. The 5 clock cycles are as follows:

1. Instruction fetch cycle(IF)

IR <- Mem[PC];
NPC <- PC + 4;


Operation—Send out the PC and fetch the instruction from memory into the instruction register (IR); increment the PC by 4 to address the next sequential instruction. The IR is used to hold the instruction that will be needed on subsequent clock cycles; likewise, the register NPC is used to hold the next sequential PC.

1. Instruction decode/register fetch cycle (ID)

A <- Regs[rs1];
B <- Regs[rs2];
Imm <- sign-extended immediate field of IR;


Operation—Decode the instruction and access the register file to read the registers (rs1 and rs2 are the register specifiers). The outputs of the general-purpose registers are read into two temporary registers (A and B) for use in later clock cycles. The lower 16 bits of the IR are also sign extended and stored into the temporary register Imm, for use in the next cycle.

Decoding is done in parallel with reading registers, which is possible because these fields are at a fixed location in the RISC V instruction format. Because the immediate portion of a load and an ALU immediate is located in an identical place in every RISC V instruction, the sign-extended immediate is also calculated during this cycle in case it is needed in the next cycle. For stores, a separate sign-extension is needed, because the immediate field is split in two pieces.

❌❌ immediate filed 相关的研究。

The ALU operates on the operands prepared in the prior cycle, performing one of four functions depending on the RISC V instruction type:

• Memory reference: ALU Output <- A + Imm; Operation—The ALU adds the operands to form the effective address and places the result into the register ALU Output.
• Register-register ALU instruction: ALU Output <- A func B; Operation—The ALU performs the operation specified by the function code (a combination of the func3 and func7 fields) on the value in register A and on the value in register B. The result is placed in the temporary register ALU Output.
• Register-Immediate ALU instruction: ALUOutput <- A op Imm; Operation—The ALU performs the operation specified by the opcode on the value in register A and on the value in register Imm. The result is placed in the temporary register ALU Output.
• Branch: ALU Output <- NPC + (Imm << 2);Cond <- (A == B) Operation—The ALU adds the NPC to the sign-extended immediate value in Imm, which is shifted left by 2 bits to create a word offset, to compute the address of the branch target. Register A, which has been read in the prior cycle, is checked to determine whether the branch is taken, by comparison with Register B, because we consider only branch equal.

1. 内存引用；寄存器的值加上立即数，计算出来有效地址并作为 ALU 的输出；
2. 寄存器之间；ALU 对寄存器 A 和 B 中的值执行指定功能代码，结果放在 ALU 输出寄存器；
3. 寄存器和立即数；寄存器 A 的值 op 立即数；
4. 分支；ALU 将 NPC 添加到 Imm 的符号扩展立即值中，该立即值偏移 2 位以创建字偏移，用于计算分支目标的地址。通过与寄存器 B 相比，检查在上一个周期中读取的寄存器 A, 以确定是否采用分支。

The load-store architecture of RISC V means that effective address and execution cycles can be combined into a single clock cycle, because no instruction needs to simultaneously calculate a data address, calculate an instruction target address, and perform an operation on the data. The other integer instructions not included herein are jumps of various forms, which are similar to branches.

1. Memory access/branch completion cycle (MEM)

The PC is updated for all instructions: PC <- NPC;

• Memory reference: LMD <- Mem[ALUOutput] or Mem[ALUOutput] <- B; Operation—Access memory if needed. If the instruction is a load, data return from memory and are placed in the LMD (load memory data) register; if it is a store, then the data from the B register are written into memory. In either case, the address used is the one computed during the prior cycle and stored in the register ALU Output.
• Branch: if (cond) PC <- ALUOutput Operation—If the instruction branches, the PC is replaced with the branch destination address in the register ALU Output.

1. Write-back cycle (WB)
• Register-register or Register-immediate ALU instruction: Regs[rd] <- ALU Output;

• Load instruction: Regs[rd] <- LMD; Operation—Write the result into the register file, whether it comes from the memory system (which is in LMD) or from the ALU (which is in ALU Output) with rd designating the register.

🟢🟢 注意到上述是不考虑流水线的情况下的实现。