Friday, February 1, 2013

The world's most compact HQ2X in Verilog?

HQ2X is a pretty amazing algorithm that can be used to upsample images from the NES's pretty mediocre resolution of 256x240 into the double, 512x480. It translates one single pixel into a block of 2x2 pixels, by looking at the surrounding pixels and interpolating.

Here's an example of what it looks like before and after (from Super Mario Bros 2):

And here's an example from Super Mario Bros:

I really wanted to have support for HQ2X in my FPGA NES, so I had to write the algorithm in Verilog. I found some info about HQ2X on the nesdev forums, and it turns out there exists some symmetry in the HQ2X algorithm so that it can be represented pretty compactly in C++. Rewriting it in Verilog while still conserving FPGA resources was a fun but challenging problem!

My VGA core is still clocked at 21.4772Mhz. This means that I need to output one pixel every clock cycle.

I have 2 x 256 pixels of blockram for the input pixels from the PPU, i.e. the two most recent lines seen. I call those Prev and Curr.

To process pixel E, HQ2X considers all pixels surrounding E:
A B C       -- Previous line (populated from Block RAM Prev)
D E F       -- Current line  (populated from Block RAM Curr)
G H I       -- Next line     (populated from the NES PPU)
I treat this as a sliding window, so that for every new input pixel, I just shift everything one step to
the left, and fill C, F, I with the new inputs from PPU or Prev/Curr. Note: You need to apply some special treatment on the edges of the screen, so you never try to read pixels outside of the visible screen area.

HQ2X contains a function diff(), used to compare if two pixels are similar enough. It compares each surrounding pixel against E, and this results in 8 similarity values. These are all packed into one byte called 'pattern'.

Overall my verilog structure works as follows:
(Pipelined, so that when Clock 4 starts, Clock 0 will start processing the next set of input).

Clock 0: Grab the next pixel from Prev[x] into C, compute 2 bits of 'pattern'
Clock 1: Grab the next pixel from Curr[x] into F, compute 2 bits of 'pattern'
Clock 2: Read next pixel from PPU into I, write it to Prev[x], compute
2 bits of 'pattern'
Clock 3: Compute 2 bits of 'pattern'
Clock 4: Perform: a[0] = blend(hqTable[pattern], E, A, B, D, F, H); pattern = rotate[pattern];
Clock 5: Perform: a[1] = blend(hqTable[pattern], E, C, F, B, H, D); pattern = rotate[pattern];
Clock 6: Perform: b[1] = blend(hqTable[pattern], E, I, H, F, D, B); pattern = rotate[pattern];
Clock 7: Perform: b[0] = blend(hqTable[pattern], E, G, D, H, B, F); pattern = rotate[pattern];

I use a 4-line buffer for the output, so while HQ2X writes to row a,b the vga module reads from row c,d and vice versa.

I rearranged the blend() function into a that a simpler form that could represent the whole function on the form:
Result = (Input1 * Mul1 * 2 + Input2 * Mul2 + Input3 * Mul3) >> 4
Where Mul1 is a 3-bit value, and Mul2 and Mul3 are 2-bit values. This means I needed only two 2x5 bit multipliers and one 3x5 bit multiplier. Those were cheap enough to implement on LUTs, so I didn't even need to use FPGA DSP units.

I also got rid of much of the RGB->YUV mess by instead operating on integers, as follows:
function same(pixel a, pixel b):
  r = a.r - b.r
  g = a.g - b.g
  b = a.b - b.b
  y = r + g + b
  u = r - b
  v = 2 * g - r - b
  return v in -24..23 and u in -4..3 and v in -6..5
which behaves very similar to the HQ2X linked above. (Note: That implementation also differs from the original HQ2X due to the optimizations, so exact accuracy here is not really a must).

The pipelining means that every single resource is used during every clock cycle. The only time they are idling is during the horizontal and vertical blanking periods when no pixels need to be outputted.

All in all, this gave me a pretty compact Verilog implementation of HQ2X, on my Spartan-6, the resource utilization is:
Number of Slice Registers:              256 out of  18,224    1%
Number of Slice LUTs:                   461 out of   9,112    5%
Number of occupied Slices:              163 out of   2,278    7%
Number of LUT Flip Flop pairs used:     509
Number of DSP48A1s:                       0 out of      32    0%
Number of RAMB16BWERs:                    2 out of      32    6%
Number of RAMB8BWERs:                     2 out of      64    3%

And the result on the NES looks fantastic!