The big picture and saving time

The project becomes closer to completion with the top level creator being as done as possible. In this post I’ll start by explaining the top level creator and it’s awkward status of being “as done as possible”. Next, I’ll try to make the big picture of our project clearer by explaining the whole process step by step. Most importantly, I’ll showcase my results from analyzing the time consumed by the execution of Xilinx tools using various options.

The top level is supposed to analyze the ELF file containing the users code, extract the devices chosen by the user, their addresses and configuration and finally create the top level of our system based on this data. Nonetheless we can’t yet generate the proper ELF file, hence the top level creator is being fed with the design manually i.e. as done as possible. The top level creator starts by instantiating the process, SOC  and memory. Then it instantiates the modules for the IO devices and accelerators chosen by the user with their correct configuration and addresses. it terminates any unused parts of the SOC switch and finally runs Emacs to parse the file with Verilog-Mode.

The whole process is supposed to go like this: the user enters his code in our web-based IDE which is compiled to generate an ELF file. The top level creator analyzes this ELF file and creates the top level for our system with all the user chosen devices and configurations. The design is then run through Xilinx tools in command line mode. Next data2mem adds the software instructions to the bit stream which is then manipulated to fit the usage of PIC and transfer via Ethernet. Finally this bit stream is used by the PIC to program the FPGA at the user’s end.

The time consumed by the whole process is very crucial to deliver a smoother user experience. All processes take less than a second to execute except for the Synthesis process. I’ve calculated the time for this process with an SOC, a GPIO, 2 SPIs and a various number of SHA1 devices up until 8 SHA1 devices. The process take about 3 minutes to complete with the absence of any SHA1 device. It takes 6 and half minutes with 8 SHA1 devices.

The following graph shows the time consumed by each process in Synthesis. MAP consumes the most time followed by PAR and Synthesis. All processes consume more time as the SOC grow bigger.

Picture 1

Xilinx has one option that can affect the speed of this process which is the optimization effort level. The option has three levels for XST; 0, 1 & 2 with 1 being the default. As the level increases more optimization is attempted and hence more time is consumed. The option has only two levels for MAP and PAR; std & high with high being the default. The trade off with less optimization is increased device size i.e. using more slices, especially LUTs. I’ve conducted tests with variations of these options to reach the optimum choice between speed and device area. I’ve run the synthesis with all processes having the default optimization level, the fastest optimization level and then with two processes having the fastest setting while the third one having the default level setting.

The following graph shows the total execution time for each setup. The worst case scenarios occur when MAP is set to high effort level. The level 0 for effort of XST doesn’t save a significant amount of time. The best time saved up from the process is about one minute.

Picture 2

This graph shows the difference in number of LUTs for each setup with respect to the default setup. The fastest setup with all tools set to the lowest possible effort and the MAP being set to the lowest possible effort level witness significance increase in the number of used slice LUTs. The number lost increases rapidly as the SOC grows bigger. The setup closest to the default one is the one where by MAP and PAR both perform std optimization level while XST perform the default optimization.

Picture 3

It is worth mentioning that the number of used Slice Registers doesn’t change significantly through various setup.Moreover, as can be observed the optimization effort becomes more crucial with the increase in number of SHA1 devices. The system without any SHA1 device consumes almost the same area with various optimization effort levels.

Thus so far the optimum setup would be to set both MAP and PAR to low optimization level hence saving a lot of time while keeping the XST set to the default optimization level to save up on used LUTs.

You may also like...

Leave a Reply