# VP - HPCA 14

2022年8月24日大约 31 分钟

# # VP - HPCA 14

## # Abstract

Dedicating more silicon area to single thread performance will necessarily be considered as worthwhile in future – potentially heterogeneous – multicores.

In particular, Value prediction (VP) was proposed in the mid 90’s to enhance the performance of high-end uniprocessors by breaking true data dependencies.

💚💚 流水线并行 & 处理器并行

• 流水线：提高指令的并行度；流水线聚焦于指令，所以是 uniprocessor
• 多处理器：提高处理器的并行度

In this paper, we reconsider the concept of Value Prediction in the contemporary context and show its potential as a direction to improve current single thread performance.

First, building on top of research carried out during the previous decade on confidence estimation, we show that every value predictor is amenable to very high prediction accuracy using very simple hardware. This clears the path to an implementation of VP without a complex selective reissue mechanism to absorb mispredictions.

Prediction is performed in the in-order pipeline frond-end and validation is performed in the in-order pipeline back-end, while the outof-order engine is only marginally modified.

@todo 🔴🔴🔴

Second, when predicting back-to-back occurrences of the same instruction, previous context-based value predictors relying on local value history exhibit a complex critical loop that should ideally be implemented in a single cycle.

To bypass this requirement, we introduce a new value predictor VTAGE harnessing the global branch history. VTAGE can seamlessly predict back-to-back occurrences, allowing predictions to span over several cycles. It achieves higher performance than previously proposed context-based predictors.

🧡🧡 一些理解

1. VTAGE 利用了全局分支历史，是如何体现的？
2. span over several cycles, 如何跨越几个 cycle?

## # Introduction

Gabbay et al. and Lipasti et al. independently proposed Value Prediction to speculatively ignore true data dependencies and therefore shorten critical paths in computations.

VP 可以缩短关键路径。

Said penalty can be as high as the cost of a branch misprediction, yet the benefit of an individual correct prediction is often very limited.

As a consequence, high coverage is mostly irrelevant in the presence of low accuracy.

The Forward Probabilistic Counters (FPC) scheme yields value misprediction rates well under 1%, at the cost of reasonably decreasing predictor coverage.

FPC 的错误预测率远低于 1%，同时牺牲了预测覆盖率。

Our experiments show that when FPC is used, no complex repair mechanism such as selective reissue is needed at execution time.

First, we present a simple yet efficient confidence estimation mechanism for value predictors. The Forward Probabilistic Counters (FPC) scheme yields value misprediction rates well under 1%, at the cost of reasonably decreasing predictor coverage.

All classical predictors are amenable to this level of accuracy.

FPC is very simple to implement and does not require substantial change in the counters update automaton. Our experiments show that when FPC is used, no complex repair mechanism such as selective reissue is needed at execution time. Prediction validation can even be delayed until commit time and be done in-order: Complex and power hungry logic needed for execution time validation is not required anymore. As a result, prediction is performed in the in-order pipeline front-end, validation is performed in the in-order pipeline back-end while the out-of-order execution engine is only marginally modified.

• 实现简单，不需要对计数器更新自动机进行实质性更改
• 在执行阶段不需要使用复杂的修复机制如 selective reissue
• Validation 可以推迟到 commit 阶段并按顺序完成；这意味着复杂耗电的 execution 阶段的 valudation 不再需要了
• out-of-order engine 只需做轻微修改

FPC 是一种置信度的衡量机制。FPC 的作用在于降低 misprediction rate.

Second, we introduce the Value TAGE predictor (VTAGE). This predictor is directly derived from research propositions on branch predictors [21] and more precisely from the indirect branch predictor ITTAGE.

VTAGE is the first hardware value predictor to leverage a long global branch history and the path history. Like all other value predictors, VTAGE is amenable to very high accuracy thanks to the FPC scheme.

VTAGE is shown to outperform previously proposed context-based predictors such as Finite Context Method and complements stride-based predictors.

• VTAGE 是一个硬件 value predictor. 其利用了长期全局 branch history 和 path history.
• 由于 FPC 机制，VTAGE 具有很高的精度

• global branch history
• path history

Moreover, we point out that unlike two-level predictors (in particular, predictors based on local value histories), VTAGE can seamlessly predict back-to-back occurrences of instructions, that is, instructions inside tight loops. Practical implementations are then feasible.

Prediction validation can even be delayed until commit time and be done in-order: Complex and power hungry logic needed for execution time validation is not required anymore.

❌❌❌ 但是这样的话，我们如何保证预测的正确性呢？

As a result, prediction is performed in the in-order pipeline front-end, validation is performed in the in-order pipeline back-end while the out-of-order execution engine is only marginally modified.

## # Questions

🤷‍♂️🤷‍♂️🤷‍♂️ 从以上对于文章的阅读，我们需要从文章中找到以下问题的答案：

1. FPC 的实现原理是什么？
2. VTAGE 如何利用 global branch history 和 path history? 其与上下文有关是如何体现的？
3. VATGE 如何解决 tight lopp 的问题？

Sazeides et al. refine the taxonomy of Value Prediction by categorizing predictors.

1. Computational，计算的
2. Context-based

1. 第一层结构是 VHT(Value History Table)
2. 第二层结构是 VPT(Value Prediction Table)

VHT hash 到 VPT 上，VPT 上包含了实际的预测。需要注意，通常而言，VHT 和 VPT 都含有一个饱和计数器，以便于衡量置信度。

Goeman 实现了 diff-FCM, 通过追踪 local history 中的 diff, 而不是 value 本身，这样达到了更加节省空间的目的。

Zhou 实现了 gDiff 预测器，gDiff 计算了一个指令的结果和最后 n 个动态指令结果之间的 diff, 如果发现了一个 stable difference, 也可以称之为 stride, 则该指令的结果可以通过先前额指令进行预测。然而，gDiff 的缺陷在于，其依赖于另一个预测器在预测阶段去预测全局的投机值。但是正因为如此，gDiff 预测器可以被添加在任何 top of any predictor.

## # Motivation

We identify two factors that will complicate the adaptation and implementation of value predictors in future processor cores.

First, the misprediction recovery penalty and/or hardware complexity. Second the back-to-back predictions for two occurrences of the same instruction which can be very complex to implement while being required to predict tight loops.

1. Misprediction Recovery
2. Back-to-back prediction

A tight loop is a loop that loops many times and the loop body has few instructions.

### # Misprediction Recovery

Moreover, despite quite high coverage and reasonable accuracy, one observation that can be made from these early studies is that the average performance gain per correct prediction is rather small.

1. average misprediction penalty(处罚) $P_{value}$
2. absolute number of misprediction $N_{misp}$

$T_{recov} = P_{value} * N_{misp}$

As we already pointed out, the total misprediction recovery cost can be minimized through two vehicles: Minimizing the individual misprediction penalty and/or minimizing the total number of mispredictions.

• 降低单个预测错误惩罚
• 最小化错误预测数量

#### # Value Misprediction Scenarios

1. pipline squashing
2. selective reissue

They induce very different average misprediction penalties, but are also very different from a hardware complexity standpoint.

💚💚 什么是 pipline squashing?

💚💚 pipline squashing 做了什么事情？

Pipeline squashing is already implemented to recover from branch mispredictions. On a branch misprediction, all the subsequent instructions in the pipeline are flushed and instruction fetch is resumed at the branch target. This mechanism is also generally used on load/store dependency mispredictions. Using pipeline squashing on a value misprediction is straightforward, but costly as the minimum misprediction penalty is the same as the minimum branch misprediction penalty. However, to limit the number of squashes due to VP, squashing can be avoided if the predicted result has not been used yet, that is, if no dependent instruction has been issued.

pipline squashing 可以被用于分支预测失败的 recovery 中，也可以用与 VP 失败的 recovery 中，两者的代价是一致的。需要注意的是，VP 的 squash 可以被避免的，只要预测的结果还没有被应用，也就是说，没有 dependent instruction 被 issued.

❌❌ selective reissue 不是很好理解，需要再理解一下。

Selective reissue is implemented in processors to recover in case where instructions have been executed with incorrect operands, in particular this is used to recover from L1 cache hit/miss mispredictions (i.e. load-dependent instructions are issued after predicting a L1 hit, but finally the load results in a L1 miss). When the execution of an instruction with an incorrect operand is detected, the instruction as well as all its dependent chain of instructions are canceled then replayed.

#### # Validation at Execution vs Validation at Commit Time

1. 在实现的节点上，selective issue 必须是在 execution 的时候就实现了，其目的是为了限制错误预测的代价；而 pipeline squashing 则可以在 execution 或者 commit 的时候实现。
2. pipeline squashing 在 execution 时间去 validate 预测必须重新设计 out-of-order engine, 除此之外，预测的 value 还必须在每个乱序的阶段传播，等等。综合来看，在 exec 阶段去验证比较复杂。
3. 反之，在 commit 后进行 pipeline squashing 可能会导致预测错误后代价较高，但是其实现机制较为简单，特别是对于 out-of-order 来说，不需要增加额外的复杂机制。

Validation at ExecutionValidation at Commit
Pipeline Squashing- Results in a minimum misprediction penalty
- Need to redesign the complete out-of-order engine
- 20 ~ 40 cycle penalty
- Results in a quite high average misprediction penalty
- Do not reduce complex mechanisms in the out-of-order execution engine
- 40 ~ 50 cycle penalty
Selective Reissue- Must be implemented at execution time
- 5 ~ 7 cycle penalty
N/A

However, validating predictions at execution time necessitates to redesign the complete out-of-order engine: The predicted values must be propagated through all the out-of-execution engine stages and the predicted results must be validated as soon as they are produced in this out-of- order execution engine.

On the contrary, pipeline squashing at commit results in a quite high average misprediction penalty since it can delay prediction validation by a substantial number of cycles. Yet, it is much easier to implement for Value Prediction since it does not induce complex mechanisms in the out-of-order execution engine.

It essentially restrains the Value Prediction related hardware to the in-order pipeline front-end (prediction) and the in-order pipeline back-end (validation and training). Moreover, it allows not to checkpoint the rename table since the committed rename map contains all the necessary mappings to restart execution in a correct fashion.

pipeline at commit 会导致较高的 misprediction penalty, 但是其优点在于不需要重新设计复杂的 out-of-order engine.

Balancing Accuracy and Coverage

The total misprediction penalty Trecov is roughly proportional to the number of mispredictions. Thus, if one drastically improves the accuracy at the cost of some coverage then, as long as the coverage of the predictor remains quite high, there might be a performance benefit brought by Value Prediction, even though the average value misprediction penalty is very high.

$T_{recov}$ 与错误预测的数量大致成正比，所以如果可以在牺牲一些覆盖率的情况下提升精度，那么总的 VP 性能是可以得到提升的。

#### # Reissue

all instructions after the first-use are kept in the IQ until they are no longer speculative, and may re-issue from there with minimal delay in case of a misprediction.

Selective Reissue — only instructions dependent on the predicted value (either directly or indirectly) are kept in the IQ until the prediction is resolved.

#### # Refetch

Refetch — a value mispredict is treated like a branch mispredict. Instructions beginning with the first-use of the predicted value are squashed, and the fetch unit is responsible for getting them back in the machine.

value 的 misprediction 可以看做分支预测的 misprediction, 使用预测值的指令将被全部清除掉，然后重新 fetch. 注意这边也使用了定语，开始于第一个使用预测值的指令。

### # Back-to-back prediction

Unlike a branch prediction, a value prediction is needed rather late in the pipeline (at dispatch time).

📍📍📍 dispatch

The instruction dispatch unit controls when the decoded instructions are dispatched to the execution pipelines. It includes Issue Queues for storing instruction pending dispatch to execution pipelines[2].

Issue Queues 是用来保存将要被 execution 的指令。

Thus, at first glance, prediction latency does not seem to be a concern and long lookups in large tables and/or fairly complex computations could be tolerated.

However, for most predictors, the outcomes of a few previous occurrences of the instruction are needed to perform a prediction for the current instance.

💚💚 这边有个细节，仔细看那段英文原文的话我们可以发现，其描述的主题对象一直是 instruction, 即指令先前的出现和当前预测之间的关系。

Consequently, for those predictors, either the critical operation must be made short enough to allow for the prediction of close (possibly back-to-back) occurrences (e.g. by using small tables) or the prediction of tight loops must be given up.

Unfortunately, tight loops with candidates for VP are quite abundant in existing programs.

Experiments conducted with the methodology we will introduce in Section 7 suggest that for a subset of the SPEC’00/’06 benchmark suites, there can be as much as 15.3% (3.4% a-mean) fetched instructions eligible for VP and for which the previous occurrence was fetched in the previous cycle (8-wide Fetch). We highlight such critical operations for each predictor in the subsequent paragraphs.

#### # LVP

Despite its name, LVP does not require the previous prediction to predict the current instance as long as the table is trained. Consequently, LVP uses only the program counter to generate a prediction.

LVP 不需要依赖先前的预测结果，但是其依赖于程序计数器 PC 的结果。

Thus, successive table lookups are independent and can last until Dispatch, meaning that large tables can be implemented.

@todo

#### # FCM

The first-level consists of a value history table accessed using the instruction address. This history is then hashed and used to index the second level table.

#### # Summary

Table lookup time is not an issue as long as the prediction arrives before Dispatch for LVP and Stride. Therefore, large predictor tables can be considered for implementation. For stride-based value predictor, the main difficulty is that one has to track the last (possibly speculative) occurrence of each instruction.

For local value based predictors the same difficulty arises with the addition of tracking the n last occurrences. Moreover the critical operations (hash and the 2nd level table read) lead to either using small tables or not being able to timely predict back-to-back occurrences of the same instruction. Implementations of such predictors can only be justified if they bring significant performance benefit over alternative predictors.

The VTAGE predictor we introduce in this paper is able to seamlessly predict back-to-back occurrences of the same instruction, thus its access can span over several cycles. VTAGE does not require any complex tracking of the last occurrences of the instruction.

Section 8 shows that VTAGE (resp. hybrid predictor using VTAGE) outperforms a local value based FCM predictor (resp. hybrid predictor using a local value based FCM predictor).

### # Commit Time Validation and Hardware Implications on the Out-of-Order Engine

In the previous section, we have pointed out that the hardware modifications induced by pipeline squashing at commit time on the Out-of-Order engine are limited.

In practice, the only major modification compared with a processor without Value Prediction is that the predicted values must be written in the physical registers before Dispatch.

At first glance, if every destination register has to be predicted for each fetch group, one would conclude that the number of write ports should double.

In that case the overhead on the register file would be quite high. The area cost of a register file is approximately proportional to $(R + W) * (R + 2W)$, $R$ and $W$ respectively being the number of read ports and the number of write ports.

Assuming $R = 2W$, the area cost without VP would be proportional to $12W^2$ and the one with VP would be proportional to $24W^2$, i.e. the double. Energy consumed in the register file would also be increased by around 50% (using very simple Cacti 5.3 approximation).

For practical implementations, there exist several opportunities to limit this overhead.

For instance one can limit the number of extra ports needed to write predictions. Each cycle, only a few predictions are used and the predictions can be known several cycles before Dispatch: One could limit the number of writes on each cycle to a certain limit, and buffer the extra writes, if there are any.

Assuming only $W/2$ write ports for writing predicted values leads to a register file area of $35W^2 /2$ , saving half of the overhead of the naive solution. The same saving on energy is observed (Cacti 5.3 estimations).

Another opportunity is to allocate physical registers for consecutive instructions in different register file banks, limiting the number of write ports on the individual banks. One can also prioritize the predictions according to the criticality of the instruction and only use the most critical one, leveraging the work on criticality estimation of Fields et.

1. 例如可以限制写入预测需要的额外寄存器数量，下面是一些举例和计算。
2. 为连续指令分配物理寄存器，限制单个 bank 的写入端口数。甚至也可以根据优先级选择使用最关键的指令。

Exploring the complete optimization to reduce the overhead on the register file design is out of the scope of this paper. It would depend on the precise micro-architecture of the processor, but we have clearly shown that this overhead in terms of energy and silicon area can be reduced to less than 25% and 50% respectively. Moreover, this overhead is restricted to the register file and does not impact the other components of the out-of-order engine. Similarly, thanks to commit time validation, the power overhead introduced by Value Prediction will essentially reside in the predictor table.

❌❌❌ 这句话难理解：Similarly, thanks to commit time validation, the power overhead introduced by Value Prediction will essentially reside in the predictor table.

### # Maximizing Value Predictor Accuracy Through Confidence

As we already pointed out, the total misprediction recovery cost can be minimized through two vehicles: Minimizing the individual misprediction penalty and/or minimizing the total number of mispredictions.

When using the prediction is not mandatory (i.e. contrarily to branch predictions), an efficient way to minimize the number of mispredictions is to use saturating counter to estimate confidence and use the prediction only when the associated confidence is very high.

For instance, for the value predictors considered in this study, a 3-bit confidence counter per entry that is reset on each misprediction leads to an accuracy in the 95-99% range if the prediction is used only when the counter is saturated. However this level of accuracy is still not sufficient to avoid performance loss in several cases unless idealistic selective reissue is used.

To increase accuracy, Burtscher et al. proposed the SAg confidence stimation scheme to assign confidence to a history of outcomes rather than to a particular instruction. However, this entails a second lookup in the counter table using the outcome history retrieved in the predictor table with the PC of the instruction. A way to maximize accuracy without increasing complexity and latency would be preferable.

We actually found that simply using wider counters (e.g. 6 or 7 bits) leads to much more accurate predictors while the prediction coverage is only reduced by a fraction.

Prediction is only used on saturated confidence counters and counters are reset on each misprediction. Interestingly, probabilistic 3-bit counters such as defined by Riley et al. augmented with reset on misprediction achieve the same accuracy for substantially less storage and a marginal increase in complexity.

We refer to these probabilistic counters as Forward Probabilistic Counters (FPC). In particular, each forward transition is only triggered with a certain probability.

In this paper, we will consider 3-bit confidence counters using a probability vector $v = \{1, 1/16, 1/16, 1/16, 1/16, 1/32, 1/32\}$ for pipeline squashing at commit and $v = \{1, 1/8, 1/8, 1/8, 1/8, 1/16, 1/16\}$ for selective reissue, respectively mimicking 7-bit and 6-bit counters.

This generally prevents all the considered VP schemes to slow down execution while minimizing the loss of coverage (as opposed to using lower probabilities). The used pseudo-random generator is a simple Linear Feedback Shift Register.

Using FPC counters instead of full counters limits the overhead of confidence estimation. It also opens the opportunity to adapt the probabilities at run-time as suggested in and/or to individualize these probabilities depending on the criticality of the instructions.

### # The Value TAgged GEometric Predictor

As it uses branch history to predict, we expect VTAGE to perform much better than other predictors when instruction results are indeed depending on the control flow.

Nonetheless, VTAGE is also able to capture control-flow independent patterns as long as they are short enough with regard to the maximum history length used by the predictor.

In particular, it can still capture short strided patterns, although space efficiency is not optimal since each value of the pattern will reside in an entry (contrarily to the Stride predictor where one pattern can be represented by a single entry).

TAGE 使用了分支的历史进行预测，当指令的结果实际依赖于控制流的时候，我们希望 VTAGE 能比其他的预测器表现得更好。尽管如此，VTAGE 也能够捕获 control-flow independent patterns, 只要他们相对于预测器使用的最大历史长度足够短。

❌❌ 控制流

Fig. 2 describes a (1+N)-component VTAGE predictor. The main idea of the VTAGE scheme (exactly like the ITTAGE scheme) is to use several tables – components – storing predictions. Each table is indexed by a different number of bits of the global branch history, hashed with the PC of the instruction.

The different lengths form a geometric series (i.e. VT1 is accessed with two bits of the history, VT2 with four, VT3 with eight and so on).

These tables are backed up by a base predictor – a tagless LVP predictor – which is accessed using the instruction address only.

In VTAGE, an entry of a tagged component consists of a partial tag, a 1-bit usefulness counter u used by the replacement policy, a full 64-bit value val, and a confidence/hysteresis counter c. An entry of the base predictor simply consists of the prediction and the confidence counter.

❌❌❌ bits of global barnch history

🔴🔴 至于为什么是异或，还需要进行深入的思考。

VTAGE 主要是使用了很多 table, VT1, VT2, …, VTn, 分别代表的含义是：VT1 关联了 2-bit 的 global branch history, VT2 为 4-bit, VT3 为 8-bit, 以此类推，这就是等比级数或者几何级数。

At prediction time, all components are searched in parallel to check for a tag match. The matching component accessed with the longest history is called the provider component as it will provide the prediction to the pipeline.

At update time, only the provider is updated.

On either a correct or an incorrect prediction, $c$ and $u$ are updated.

On a misprediction, $val$ is replaced if $c$ is equal to 0, and a new entry is allocated in a component using a longer history than the provider: All “upper” components are accessed to see if one of them has an entry that is not useful ($u$ is 0). If none is found, the u counter of all matching entries in the upper components are reset, but no entry is allocated. Otherwise, a new entry is allocated in one of the components whose corresponding entry is not useful. The component is chosen randomly.

The main difference between VTAGE and ITTAGE is essentially the usage: The predicted value is used only if its confidence counter is saturated. We refer the reader to for a detailed description of ITTAGE.

VTAGE 和 ITTAGE 不同的点在于，饱和计数器饱和的时候才使用预测的值。

Lastly, as a prediction does not depend on previous values but only on previous control-flow, VTAGE can seamlessly predict instructions in tight loops and behaves like LVP in Fig. 1. However, due to index hash and multiplexing from multiple components, it is possible that its prediction latency will be higher, although this is unlikely to be an issue since prediction can span several cycles.

## # Evaluation Methodology

### # Value Predictors

#### # Single Scheme Predictors

We study the behavior of several distinct value predictors in addition to VTAGE.

Namely, LVP, the 2-delta Stride predictor (2D-Stride) as a representative of the stride-based predictor family4 and a generic order-4 FCM predictor (o4-FCM)

All predictors use 3-bit saturating counters as confidence counters. The prediction is used only if the confidence counter is saturated.

Baseline counters are incremented by one on a correct prediction and reset on a misprediction. The predictors were simulated with and without FPC (See Section 5). As the potential of VP has been covered extensively in previous work, we limit ourselves to reasonably sized predictors to gain more concrete insights.

We start from a 128KB LVP (8K-entry) and derive the other predictors, each of them having 8K entries as we wish to gauge the prediction generation method, not space efficiency. Predictor parameters are illustrated in Table 1.

❌❌ 我们从 128K 的 LVP 开始，推导其他预测器，每个预测器都有 8K 个 entries, 因为我们希望衡量预测生成方法，而不是空间效率。

For VTAGE, we consider a predictor featuring 6 tables in addition to a base component. The base component is a tagless LVP predictor. We use a single useful bit per entry in the tagged components and a 3-bit hysteresis/confidence counter c per entry in every component. The tag of tagged components is 12+rank-bit long with rank varying between 1 and 6. The minimum and maximum history lengths are respectively 2 and 64 as we found that these values provided a good tradeoff in our experiments.

For o4-FCM, we use a hash function similar to those…..

We consider that all predictors are able to predict instantaneously. As a consequence, they can seamlessly deliver their prediction before Dispatch.

This also implies that o4- CM is – unrealistically – able to deliver predictions for two occurrences of the same instruction fetched in two consecutive cycles. Hence, its performance is most likely to be overestimated.

### # Simulator

In our experiments, we use the gem5 cycle-accurate simulator (x86 ISA).

We model a fairly aggressive pipeline: 4GHz, 8-wide superscalar, out-of-order processor with a latency of 19 cycles.

We chose a slow front-end (15 cycles) coupled to a swift back-end (3 cycles) to obtain a realistic misprediction penalty.

🧡🧡🧡

🧡🧡🧡

#### # Misprediction Recovery

We illustrate two possible recovery scenarios, squashing at commit time and a very idealistic selective reissue.

In both scenarios, recovery is unnecessary if the prediction of instruction I was wrong but no dependent instruction has been issued before the execution of I, since the prediction is replaced by the effective result at execution time. This removes useless squashes and is part of our implementation.

misprediction 时候的恢复有两种方式：

1. squashing at commit time
2. 十分理想主义的 selective reissue(理想主义是作者对其的评价，不代表我本人观点)

## # Reference

1. A. Perais and A. Seznec, "Practical data value speculation for future high-end processors", High Performance Computer Architecture (HPCA) 2014 IEEE 20th International Symposium on, Feb 2014. ↩︎