Coarse-Grained Threading

Finally I’ve changed the threading model of the AEMB. While some instructions are still buggy and hence the demo program doesn’t run properly, I still want to use this post and the next one to explain the changes I’ve made so far.

In this post I’ll explain about the design of the new threading model and how is it expected to behave. I will provide pipeline diagrams and actual waveform from the demo program running on it. 

The Coarse grained model changes threads whenever a branch instruction, be it conditional, unconditional or a return is inserted into the pipeline. In the following, I’ll explain how the model behaves for all the various general cases. Those cases include how the program performs in the initialization stage, how it handles delay slots and back to back thread switches.

I will use pipeline figures to explain the expected behavior of the processor. Each column represents a stage of the pipeline. The top row contains the names of the register holding the data at this stage of the pipeline. For example, having an instruction A1 at the stage labeled “EX” indicates that instruction A1 is currently at the stage where EX registers are holding the results of execution.

ICH_ADR is not a register; it is simply the current address being supplied to the cache. ICH_ADR gets its value from several registers before it and the choice depends on the state of the pipeline. Here is a list of the possible options and their content:

  • rADR : The increment of the current address
  • branch_target: The target of a branch operation. It has the target of the branch from the previous thread whether it’s taken or not.
  • Stall: loads the same address again for forward hazards or loads an increment of the last instruction in the previous thread which is the delay slot instruction.
  • branch_target _1: a delayed version of the branch_target.

Instructions from one thread are given the letter A and the letter B represents instructions from another thread. Instructions are considered written back once they have left the MX stage

First the normal operation will involve fetching instructions from the same thread until a branch instruction is encountered. This will have serious implications on data dependencies as extra dependencies need to be resolved compared to the old core. Hence more forwarding and stalling will be needed.

1
2

When a branch instruction is inserted in the pipeline, it is detected in the Instruction Fetch (IF) stage and the next instruction to be fetched is the branch target from the other thread.

34

However, if the branch instruction contains a delay slot, the pipeline will fetch the first instruction in the other thread then the delay slot and then continue with the other thread.

5
6

If a thread switch takes place but the first instruction in the other thread (B1) is a branch then another switch will be required that returns to the first thread. However, the pipeline needs to be stalled until the branch instruction from the first thread (A1) is resolved in the execution stage. Only one stall is necessary as shown in the figure below.

7
8

However, if A1 has a delay slot the pipeline need not be stalled as it’ll be filled by the delay slot (A2).

9

The behavior is similar if the second branch is the one that has the delay slot or both of them have delay slots except that the branch target fetching will be late by one cycle so it need to be fetched from a delayed version of the branch_target as it will now be holding the target for the B thread.

10
11

In the initial stages of the program both threads begin at address zero. The first address starts execution and once it hits a branch the core switches to the other thread which starts at address zero as well. This way the two threads will be executing the same initial instructions necessary to initialize the core until threading split takes place. The splitting mechanism is exactly the same for the original core. It causes a branch instruction to be taken for one thread but not the other hence setting the threads on different paths. 

In the next posts I’ll explain the changes I’ve made to the processor and my debugging method and finally the current bug that I’m trying to resolve at the moment.

You may also like...

Leave a Reply