Pipelined Design for Demosaic Core
This week, I worked on the design of the circuitry for the demosaic IP core. On previous week, my supervisor and I has come to a conclusion of the usage on HQLI algorithm, as it is more hardware friendly and the result is satisfying as well. The design of the demosaic core is separated into two major parts, which is the buffer and the calculator. As supervised by my supervisor previously, so that I would not go into the wrong direction again, I start to work out the design from the output of the core.
The Adder Cascade and Adder Tree Pipeline
As I read through the documentation of the DSP48A1 slice from Xilinx, which gave a very good example on understanding the concept of pipeline in hardware design. A typical use of design for complex computation would be adder tree, which consist of pipeline of multiple inputs into multiple adders, to get the final output. However, the DSP slice actually proposed the usage of cascade adder, which cascade the input into one adder in each time, to improve the power efficiency of the design. This allow the circuit to be usable in high frequency, as the circuit would only pass through maximum of one adder at each cycle. The drawback of cascade adder is it actually uses more adder and latches compared to the adder tree.
Besides, the usage of DSP slice actually saved the usage of logic resources in the FPGA, and the circuitry would be smaller compared to the usage of adder tree.
The DSP block also offers the usage of multiplier and pre-adder, which is very suitable to be used in implementing complex computation algorithm. As I go through the circuit more, I noticed that the circuitry could be minimized more by replacing the multiplier with wire shift, which is basically wiring out the multiplier of 2 to the desired output. For example, the design of the circuit would require an addition of two different inputs, and a multiplication constant of 2. The circuit could be minimized by using only one single adder for the addition, and the output is taken from the bit 1 to bit n (removal of one least significant bit). A wire shifting from the output has been performed to do a multiplication directly, without wasting one more cycle to perform the multiplication, which saved the logic resources and improve the circuit efficiency.
The same technique can be done on the division, by removing the most significant bit in this case. However, the drawback of using wire shift is that one must know that the multiplication or division is a constant and a power of 2.
Image Frame Quad Output Edge Interpolation
As discussed previously, extra circuitry would be needed to handle the edge of each quadrant output, which would significantly increase the circuitry size and inefficiency, as some of the circuitry would only be used for once for each frame of image. One of the method that could be used to maintain the image as if it has never separated into quad output, is to store the output of the neighbor quadrant into its own buffer, after it has completely buffered for current line. For example, for a 960 pixels per line image quadrant, this would result in a BRAM buffer to buffer number of pixels up to 960+x, where x is the neighbor quadrant pixel value, and the interpolation would only take interpolate from 1st pixel until the 960th pixels, but using the x pixels value for interpolation data only. This would remove the necessity of using bilinear interpolation on the edge of the quadrant, that might result in different output in the middle of the image later, because the interpolation would be done with only single algorithm.