In the previous post I’ve explained how the threading model of the new AEMB. In this post I will explain all the changes to the old AEMB core that were necessary to accommodate a coarse grained model and sum up with an analysis of how the new threading model affects the performance.

To begin with the major change was in the instruction address circuit. The registers holding the possible address for the next instruction were greatly altered. For a list of those registers refer to the previous post. Furthermore, circuit was added to detect thread switching and thread hazards where the pipeline needs to be stalled till the target of the next thread has been determined. The biggest addition was in detecting data hazards. The design was expanded to detect hazards betwen back to back instructions and instruction with two gaps between them. Last but not least, two forwarding paths were added as shown in the following figure.


The GPHA signal which is the signal that specifies the current thread being performed was updated to change only on thread switches or delay slots. Also this signal is now propagating along the pipeline as part of the register file addresses to ensure that the core will not mistakenly flag a data hazard between two different threads.

Moreover, Branch target and branch condition are now evaluated only when there is a branch instruction. Hence the branch target register holds its value until the next thread switch. In the previous core the register holding the branch target was continuously being overwritten by whatever data present in the pipeline.

Another instruction path that needed to be altered was the immediate instruction. It was necessary to accommodate that its data will be used in the next cycle and to make sure that its data is available after the pipeline recovers from a stall.


The new model improves the performance of the processor regarding control hazards but it reduces its ability to resolve data dependencies.

For control hazards, AEMB needed to insert a stall whenever a branch without a delay slot is encountered to stall the pipeline until the branch condition and its target are determined. As for the new CPU, now that branch targets no longer need to be used except when a threading switch is in place, the processor only stalls the pipeline on branching if it encounters two branch instructions back to back and the first of them has no delay slot.

For data hazards, AEMB forwarded data from arithmetic instruction only from the EX register and needed to insert only one stall for hazards that can’t be resolved. However, the new CPU requires data forwarding from three different locations which doesn’t affect the efficiency of the clock cycles but greatly increases both the area and the critical path of the processor. Furthermore, the processor now might need to stall the pipeline for up to three cycles in the worst case scenario which is two back to back instructions having a data dependency that can’t be resolved by forwarding.

Further quantitative analysis is necessary to determine which case triumphs and decides the performance of the processor and to verify the effect of the changes on single thread applications.

To avoid lengthy posts, debugging AEMB and an explanation of the current bug can be found in the next post.