This week, the real comparison starts, to identify how much faster the demosaic core is, compared to software implementation.
Software Implementation of Demosaic
Previously, I had finished the software implementation of demosaicing for various demosaic algorithm, and tested them in Darktable software. The darktable software is very convenient because it allows to zoom in to pixel and able to view the difference or to debug the algorithm. Initially, I directly ported the C code that had been written to the Xilinx SDK. Ever since I port it into the Zedboard, and start the counter, I found that my code actually suck! It uses too much clock cycles to execute it, and it is totally not practical to use it for real software implementation.
The first problem that I encountered during the software implementation, is the insufficient memory for the quarter frame. Assume that each pixel would occupy a single byte, 960×540 pixels resolution would actually yield 518.4k byte, where the Zedboard RAM is only 512 MB. Besides, I would need 518.4kB x 4, for all raw pixel, red pixel, blue pixel, and green pixels. Thus, I made a division for the pixels line, and making a simple loop to look through the algorithm few more times, to reach the exact resolution of 860×540. For example, 2 times from the beginning and end of the frame would be treated with bilinear interpolation, left out with 536 lines for HQLI interpolation. If the HQLI algorithm only interpolates 4 lines, it would need to loop for 536/4 = 134 times, to reach the 960×540 resolution. Thus, the total raw lines that would be saved is only 8 lines (4 for bilinear interpolation and 4 for HQLI), which only take 8 x 960 = 7680 bytes of memory. For all RGB pixels storage, it would be 4 x 7680 = 30,720 bytes.
Efficient Demosaic C Implementation
After setting up the memory, I had optimized the demosaic algorithm implementation in C. The implementation in hardware actually facilitates me for the efficient code implementation in C. After many tries, I also found that subtraction actually consume more cycles compared to additions. If a variable is going to subtract 4 variables, it is always more efficient to sum up the 4 variables, and subtract it in one go. The implementation in efficient coding also taught me a lot in writing the conditional loops, where code should be structured into as little loops as possible. Each loop will consume some cycles, and they will actually yield to a very large cycle numbers in algorithm implementation. For instance, addition of a conditional loop adds extra 100 cycles, but if the algorithm is going to repeat for 960×134 times in my case, it would add up to 12,864,000 more cycles. Amazing right? The numbers actually frightens me!
In the end, I actually re-coded most of the code that I had written for the demosaicing, and end up it is 4 times smaller than the existing algorithm. This is all tested using the Zedboard. It is really a good learning tool for hardware and software.
Neon Software Implementation
After implemented the demosaicing algorithm in C, tested with Zedboard, I tried to implement the algorithm using the Neon inside the Zynq. I found Xilinx documentation about the usage and characteristic of Neon here. At first, I really thought it was just as simple as enabling the compiler flag, and some optimization flag. But after using objdump on the elf file that had been produced, I found that they have no difference at all. Referred to the example provided from the Xilinx link, automatic vectorization of the algorithm only reduce the cycle by 2.5 times, which is very low as well. later, I found some source from the Internet, where they recommend the usage of Intrinsic Neon, which is C function for assembly functions. The link did an experiment with Neon, and intrinsic Neon will did better in reducing the cycles, but assembly Neon would reduce even more, which can up to 7.5 times.
Comparison of Software and Hardware Implementation of Demosaic Core
After all, I tested out and experimented the demosaic core in hardware and software, the result is that hardware implementation would be 16 times faster than the software implementation of demosaicing in the Zynq. If the Zynq is turned on with Neon enabled, the demosaic core would still be at most 8 times faster than the software implementation, which also proven that the demosaicing process would be accelerated if it is implemented in hardware. The figure below shows the comparison of implementation vs the number of cycles taken.
Zynq Core Implementation Frequency
One thing that I found very annoying about the Xilinx software is when I try to test out my hardware implementation speed of demosaic core, it drives only for 95 MHz, which is super low, compared to the synthesis frequency of 250 MHz, it does not even drive 50% of the synthesis frequency. Instead of debugging it in the project that I had made in Xilinx PlanAhead, I created a plain new project and tested it without using the PS of Zynq. And surprisingly, the core is able to drive until 191 MHz on implementation, which is around 75% of the synthesis frequency. This may due to the GPIO that had been instantiated on PS, slowed down the implementation of the core.