Hardware-Assisted Synchronisation

An idea about hardware assisted synchronisation for the AEMB came about recently. The idea centered on the problem of how multiple threads would communicated between each other. The solution to this problem has always been synchronisation primitives.

The AEMB already supports the atomic MSRSET/MSRCLR instructions, which can be used as a mutex primitive. This hardware mutex can be used to build additional synchronisation operations in software. However, this is an inefficient method for doing software synchronisation.

Since the AEMB is a multi-threaded processor by default, it will be used in situations where synchronisations will need to be carried out regularly. Therefore, it may be prudent to include certain hardware devices that can help with synchronisation.

As an example, a common synchronisation problem is the producer-consumer problem. This can be helped in hardware with the use of a specialised FIFO that forces synchronisation in hardware and ensures atomic read/write operations to it. There are other classic synchronisation problems such as rendezvous that can also benefit from minor hardware additions.

These are some thing that will make it into any future AEMB.

Virtual Kernel

While everyone seems to be quite focused on high-end virtualisation, there is also room for virtualisation solutions at the low-end. One way of approaching this problem is by the use of a thin layer of virtualisation at the nano-kernel level. Instead of just abstracting hardware away, it is also possible to put in entirely virtual hardware devices for embedded applications. This allows things like I/O peripherals to be abstracted and run entirely as pure software only.

The AENIX kernel will sport such virtualisation solutions instead of just being a regular nano-kernel. It will support the use of soft-peripherals or virtual-peripherals to lower the cost of implementing complex System-on-Chips (SoC). As an example, a typical ethernet core is much larger than a regular RISC processor core. However, it would be possible to virtualise the entire MAC layer into software, which simplifies the hardware tremendously resulting in a cost savings of half the original. In addition, it is possible to build a virtual video device to transmit video frames over ethernet, which will further virtualise hardware and change the way that computing is done on the SoC.

That is the main thrust of the AENIX kernel. Instead of merely abstracting the sharing of hardware resources away, it will also feature abstract hardware that does not really exist.

This opens an exciting dimension in the realm of communications.

Power Optimisation

Since the recent LLVM 2.7 release came with initial support for the Microblaze, it is now conceivable to add some features into the LLVM to enable power optimisation for the AEMB and other architectures. The reason that LLVM is chosen instead of GCC is purely subjective – a cleaner code base and the open license adopted.

The idea of power optimisation lies behind the premise that power is a systems level problem – not a hardware one. While hardware is a large contributor, hardware cannot do anything unless it is commanded by software, which is ultimately slaved to user events. Therefore, it is unfair to place the entire power problem in the hands of the hardware designer.

One step in the right direction is to incorporate power metrics into compiler optimisation. We now have very obvious speed and size optimisation options. These flags tell the compiler to run various extra algorithms to come up with optimised code-paths that are either faster or consume less memory than the default output.

There is no reason why this cannot be extended to power optimised code.

As a simple example, let us consider a simple multiply-by-two operation. This can either be accomplished by using the hardware multiplier or by using the adder or even by using the shifter. Each functional element can be used to produce the same results from a purely mathematical standpoint.

However, each hardware block consumes very a very different power envelope. Some are simple routing devices while others are complex devices with lots of gates and transistors stuck in between. However, there are obviously trade-offs as not all these operations can complete within a single clock cycle on all architectures.

Therefore, there is room for innovation and further research in this area.

New Multi-Threading Model

In order to streamline the new AEMB processor family, a new multi-threading model is being tested. In the new multi-threading model, each core will have at least the capability of running four threads either manually or automatically. For the AEMB1, the threads would need to be switched explicitly using special software break instructions while for the AEMB4, the context switching will happen automagically and the AEMB2 will be somewhere in between the two.

As for the software, the AEMB1 will be focused for single-threaded applications. Therefore, it would only run a single-thread by default unless the switching is done explicitly. Once switched, it will continue to only run the new single-threaded code. It can do this hardware context switching for at least four threads. T0 will still be reserved for interrupt execution. So, a typical OS kernel would start T0 and run the c-runtime code before quickly switching over to T1 for the actual kernel code. It can possibly switch over to T2 or T3 for application code. When an interrupt happens, the core switches automatically back to T0 to execute the interrupt service routine before switching back to another thread.

At the other extreme, the AEMB4 will be focused for multi-threaded applications. It will run four-threads in hardware by default using interleaved multi-threading (IMT) techniques. Switching is done automatically for a true multi-threaded operation. A typical OS kernel would start on T0 and run the c-runtime code before activating T1-T3 for the application code. Each thread might be used to run different servers in a micro-kernel style OS. All interrupts will only be serviced on T0 without blocking the operations of the other threads.

This new threading model would hopefully find more real-world use by providing a versatile hardware platform for modern multi-threaded applications.

In-Cache Execution Environment

The AEMB is designed with an FPGA target technology implementation. Since this is the case, it may be prudent to exploit certain FPGA capabilities that are not present on ASIC technologies. One such capability is the ability of an FPGA to pre-load the contents of block memories from an FPGA image. This ability is often used to create a read-only RAM block or ROM block.

However, if the write-enable signal is enabled this ability can be used to pre-load the contents of any RAM block such as the one used to hold the instruction cache of the AEMB. It would be interesting to inject a set of start-up instructions into the instruction cache during power-on. While the instruction cache is necessarily small, it should be large enough to hold some sort of initial execution environment.

There are many possible applications for the initial execution environment. One such possibility is to emulate certain hardware functionality in software such as memory built-in self-test and internal register self-configuration. Another possibility would be to store a boot-loader that is used to load and run the actual application boot-loader.

This will probably become a feature in the EDK63 version of the AEMB.

Follow

Get every new post delivered to your Inbox.