I'm probably not going to work any more on the project, so rather than abandoning it completely, I decided to put the source code up on GitHub if anyone else wants to learn or get inspired.
https://github.com/strigeus/fpganes
Ludde's FPGA NES
Friday, January 3, 2014
Friday, February 1, 2013
The world's most compact HQ2X in Verilog?
HQ2X is a pretty amazing algorithm that can be used to upsample images from the NES's pretty mediocre resolution of 256x240 into the double, 512x480. It translates one single pixel into a block of 2x2 pixels, by looking at the surrounding pixels and interpolating.
Here's an example of what it looks like before and after (from Super Mario Bros 2):
And here's an example from Super Mario Bros:
I really wanted to have support for HQ2X in my FPGA NES, so I had to write the algorithm in Verilog. I found some info about HQ2X on the nesdev forums, and it turns out there exists some symmetry in the HQ2X algorithm so that it can be represented pretty compactly in C++. Rewriting it in Verilog while still conserving FPGA resources was a fun but challenging problem!
My VGA core is still clocked at 21.4772Mhz. This means that I need to output one pixel every clock cycle.
I have 2 x 256 pixels of blockram for the input pixels from the PPU, i.e. the two most recent lines seen. I call those Prev and Curr.
To process pixel E, HQ2X considers all pixels surrounding E:
A B C -- Previous line (populated from Block RAM Prev)
D E F -- Current line (populated from Block RAM Curr)
G H I -- Next line (populated from the NES PPU)
I treat this as a sliding window, so that for every new input pixel, I just shift everything one step to
the left, and fill C, F, I with the new inputs from PPU or Prev/Curr. Note: You need to apply some special treatment on the edges of the screen, so you never try to read pixels outside of the visible screen area.
HQ2X contains a function diff(), used to compare if two pixels are similar enough. It compares each surrounding pixel against E, and this results in 8 similarity values. These are all packed into one byte called 'pattern'.
Overall my verilog structure works as follows:
(Pipelined, so that when Clock 4 starts, Clock 0 will start processing the next set of input).
Clock 0: Grab the next pixel from Prev[x] into C, compute 2 bits of 'pattern'
Clock 1: Grab the next pixel from Curr[x] into F, compute 2 bits of 'pattern'
Clock 2: Read next pixel from PPU into I, write it to Prev[x], compute
2 bits of 'pattern'
Clock 3: Compute 2 bits of 'pattern'
Clock 4: Perform: a[0] = blend(hqTable[pattern], E, A, B, D, F, H); pattern = rotate[pattern];
Clock 5: Perform: a[1] = blend(hqTable[pattern], E, C, F, B, H, D); pattern = rotate[pattern];
Clock 6: Perform: b[1] = blend(hqTable[pattern], E, I, H, F, D, B); pattern = rotate[pattern];
Clock 7: Perform: b[0] = blend(hqTable[pattern], E, G, D, H, B, F); pattern = rotate[pattern];
I use a 4-line buffer for the output, so while HQ2X writes to row a,b the vga module reads from row c,d and vice versa.
I rearranged the blend() function into a that a simpler form that could represent the whole function on the form:
Result = (Input1 * Mul1 * 2 + Input2 * Mul2 + Input3 * Mul3) >> 4
Where Mul1 is a 3-bit value, and Mul2 and Mul3 are 2-bit values. This means I needed only two 2x5 bit multipliers and one 3x5 bit multiplier. Those were cheap enough to implement on LUTs, so I didn't even need to use FPGA DSP units.
I also got rid of much of the RGB->YUV mess by instead operating on integers, as follows:
function same(pixel a, pixel b):
r = a.r - b.r
g = a.g - b.g
b = a.b - b.b
y = r + g + b
u = r - b
v = 2 * g - r - b
return v in -24..23 and u in -4..3 and v in -6..5
which behaves very similar to the HQ2X linked above. (Note: That implementation also differs from the original HQ2X due to the optimizations, so exact accuracy here is not really a must).
The pipelining means that every single resource is used during every clock cycle. The only time they are idling is during the horizontal and vertical blanking periods when no pixels need to be outputted.
All in all, this gave me a pretty compact Verilog implementation of HQ2X, on my Spartan-6, the resource utilization is:
Number of Slice Registers: 256 out of 18,224 1%
Number of Slice LUTs: 461 out of 9,112 5%
Number of occupied Slices: 163 out of 2,278 7%
Number of LUT Flip Flop pairs used: 509
Number of DSP48A1s: 0 out of 32 0%
Number of RAMB16BWERs: 2 out of 32 6%
Number of RAMB8BWERs: 2 out of 64 3%
And the result on the NES looks fantastic!
Here's an example of what it looks like before and after (from Super Mario Bros 2):
And here's an example from Super Mario Bros:
I really wanted to have support for HQ2X in my FPGA NES, so I had to write the algorithm in Verilog. I found some info about HQ2X on the nesdev forums, and it turns out there exists some symmetry in the HQ2X algorithm so that it can be represented pretty compactly in C++. Rewriting it in Verilog while still conserving FPGA resources was a fun but challenging problem!
My VGA core is still clocked at 21.4772Mhz. This means that I need to output one pixel every clock cycle.
I have 2 x 256 pixels of blockram for the input pixels from the PPU, i.e. the two most recent lines seen. I call those Prev and Curr.
To process pixel E, HQ2X considers all pixels surrounding E:
A B C -- Previous line (populated from Block RAM Prev)
D E F -- Current line (populated from Block RAM Curr)
G H I -- Next line (populated from the NES PPU)
I treat this as a sliding window, so that for every new input pixel, I just shift everything one step to
the left, and fill C, F, I with the new inputs from PPU or Prev/Curr. Note: You need to apply some special treatment on the edges of the screen, so you never try to read pixels outside of the visible screen area.
HQ2X contains a function diff(), used to compare if two pixels are similar enough. It compares each surrounding pixel against E, and this results in 8 similarity values. These are all packed into one byte called 'pattern'.
Overall my verilog structure works as follows:
(Pipelined, so that when Clock 4 starts, Clock 0 will start processing the next set of input).
Clock 0: Grab the next pixel from Prev[x] into C, compute 2 bits of 'pattern'
Clock 1: Grab the next pixel from Curr[x] into F, compute 2 bits of 'pattern'
Clock 2: Read next pixel from PPU into I, write it to Prev[x], compute
2 bits of 'pattern'
Clock 3: Compute 2 bits of 'pattern'
Clock 4: Perform: a[0] = blend(hqTable[pattern], E, A, B, D, F, H); pattern = rotate[pattern];
Clock 5: Perform: a[1] = blend(hqTable[pattern], E, C, F, B, H, D); pattern = rotate[pattern];
Clock 6: Perform: b[1] = blend(hqTable[pattern], E, I, H, F, D, B); pattern = rotate[pattern];
Clock 7: Perform: b[0] = blend(hqTable[pattern], E, G, D, H, B, F); pattern = rotate[pattern];
I use a 4-line buffer for the output, so while HQ2X writes to row a,b the vga module reads from row c,d and vice versa.
I rearranged the blend() function into a that a simpler form that could represent the whole function on the form:
Result = (Input1 * Mul1 * 2 + Input2 * Mul2 + Input3 * Mul3) >> 4
Where Mul1 is a 3-bit value, and Mul2 and Mul3 are 2-bit values. This means I needed only two 2x5 bit multipliers and one 3x5 bit multiplier. Those were cheap enough to implement on LUTs, so I didn't even need to use FPGA DSP units.
I also got rid of much of the RGB->YUV mess by instead operating on integers, as follows:
function same(pixel a, pixel b):
r = a.r - b.r
g = a.g - b.g
b = a.b - b.b
y = r + g + b
u = r - b
v = 2 * g - r - b
return v in -24..23 and u in -4..3 and v in -6..5
which behaves very similar to the HQ2X linked above. (Note: That implementation also differs from the original HQ2X due to the optimizations, so exact accuracy here is not really a must).
The pipelining means that every single resource is used during every clock cycle. The only time they are idling is during the horizontal and vertical blanking periods when no pixels need to be outputted.
All in all, this gave me a pretty compact Verilog implementation of HQ2X, on my Spartan-6, the resource utilization is:
Number of Slice Registers: 256 out of 18,224 1%
Number of Slice LUTs: 461 out of 9,112 5%
Number of occupied Slices: 163 out of 2,278 7%
Number of LUT Flip Flop pairs used: 509
Number of DSP48A1s: 0 out of 32 0%
Number of RAMB16BWERs: 2 out of 32 6%
Number of RAMB8BWERs: 2 out of 64 3%
And the result on the NES looks fantastic!
Friday, January 18, 2013
Ludde's FPGA NES
I was a bit bored during Christmas, so I decided to construct a whole Nintendo Entertainment System (NES) in an FPGA. An FPGA is a programmable integrated circuit, generally programmed using a hardware description language. The FPGA contains thousands of logic blocks, that can be connected together to form complex combinatorial logic, and flipflops are used to implement memory and feedback loops.
Nintendo Entertainment System (NES) was created back in 1985 by the famous Japanese company Nintendo. It was an extremely revolutionary console at its time and was the best selling console for a number of years. I used to play NES a lot as a kid, so I have all those memories of the old games, and had an imminent urge to dig into the inner details of the console to figure out how it worked.
I got myself a Digilent Nexys 3 FPGA development board. It's a ready made FPGA board with built in Flash, RAM, USB programming interface, and power supply circuitry. This is a really quick and easy path into the world of FPGAs. The alternative could be to make the whole PCB from scratch, and that is something I want to learn some day. A guy named Kevtris has done exactly that with his FPGA Game Console.
There are two competing hardware description languages, VHDL and Verilog. I chose to learn Verilog because VHDL seemed unnecessarily verbose while Verilog offered a much more compact syntax.
The NES contains a Ricoh 2A03 CPU, virtually identical to the MOS 6502 CPU, but includes an on chip APU (Audio Processing Unit), while removing some CPU features such as Binary Coded Decimal arithmetic, supposedly to avoid paying patent royalties. Side by side to the CPU is a PPU (Picture Processing Unit), which is responsible for generating a 256x240 sized image. The console has extremely small amounts of memory with today's standards. It has a built in 2kB RAM for the CPU, and another 2kB used by the PPU. Additionally, some carts provided a few kilobytes of extra memory.
Since the CPU uses only a 16-bit address bus, it was common to include mapper chips inside of the cartridge. These are used to implement a mapping (or windowing) mechanism, so the CPU can select which part of the ROM to map into the address space. There exist a plethora of mapper configurations, around 200 different configurations or so. In some cases these were regular discrete logic chips such as 74LS161, while in other cases it was custom made ASICs. In order to support all NES games, I need to implement all those mapper chips too, as the FPGA NES uses an iNES ROM image instead of a physical cartridge.
The NES PPU runs off of a 21.4773 MHz crystal. This frequency was chosen by the original NES designers because it's the frequency used by NTSC video. These are the frequencies I make use of in my FPGA:
Here's a description of the various components that make up the FPGA NES:
Spartan 6 LX16 FPGA Core:
The heart of the board, the Spartan-6 is the programmable chip that runs the NES. It's a pretty advanced FPGA where the basic building block is a 6 input LUT instead of the more commonly used 4-input LUT used in the Spartan-3 series. Here's an FPGA Logic Cells comparison with more info.
The LUT is used to create an arbitrary function of 6 bits of inputs into one bit of output, i.e. 2^6=64 bits of state. These LUTs get interconnected and state is persisted between clock cycles through the use of flip flops, which each keep 1 bit of memory.
16Mbyte Micron Cellular RAM
This is PSRAM, a hybrid of a cheap DRAM and a more expensive SRAM. It still requires DRAM refresh cycles, but all that is happens behind the scenes so I don't need to worry about it. The RAM has an access time of 70ns, meaning that after I put the address out on the bus, I need to wait 70ns before the data is available for me to read. In my RAM logic, I wait 2 cycles (90ns). I timed it so I perform one single byte fetch per PPU clock.
The NES has separate address buses for the CPU (code) and the PPU (graphics). Since I use a single RAM chip, I'm time multiplexing the accesses. The PPU accesses ram only every second clock cycle, i.e. 50% of the time, while CPU accesses every third PPU cycle (33% of the time).
8-bit VGA
This is a rather mediocre video solution, as it only gives me 3 bits for Red, 3 bits for Green and 2 bits for Blue. This only allows for 4 levels of grayscale, so I can't even represent all NES colors accurately (see the picture). I need to move away from this once I figure out a better solution. I really wish the Digilent board would have used a proper video DAC instead, such as Analog Devices ADV7123.
PMOD Connector
These are Digilent's own expansion connectors. I created some simple audio circuitry using a 16 bit AD1866 Audio DAC connected to a single supply Operational Amplifier. The Audio DAC has a bit limited voltage range, 1.5V - 3.5V, so the peak to peak voltage after the opamp is only around 1 volt or so. I discovered the hard way by trial and error that I had to connect an electrolytic capacitor in series with the speaker to act as a high pass filter and remove the DC component to get the average voltage down to 0 volts instead of 2.5 volts.
USB Interface
The Digilent USB Interface can also be used as a bidirectional communications channel between the FPGA and a program running on the PC. The only controllers I had were USB SNES controllers, so I hooked the controller up to the PC and have a small program that reads the joystick movements and sends these over the cable to the FPGA.
6502 CPU Core
I wrote my own 6502 Verilog core, it seemed like a fun challenge and was a lot more educational and challenging than just stealing someone else's. The CPU contains an ALU, a set of registers, and a datapath structure that connects these together. Each instruction runs in a variable number of cycles, and a big microcode table controls which muxes and control signals that are active for every cycle of an instruction. There exists a lot of good documentation of how the 6502 worked down to the cycle level.
Interestingly, the 6502 had a bunch of undocumented instructions. These exist because instructions are only partially decoded by a Decode PLA, and the unused opcodes trigger random control signals in the CPU, causing unplanned things to happen. I implemented most of these, except for some very esoteric ones.
APU (Audio Processing Unit)
The NES APU contains 5 tone generators: Two square wave generators, one triangle generator, one noise generator and a DMC (Delta Modulation Channel). These are periodically controlled by the CPU by writing to a few IO registers in the APU. The outputs of these 5 channels are then combined in a way to mimic the audio hardware of the NES, by using a few look up tables implemented on the FPGA's block ram.
Because the sample frequency of the APU is 1.79MHz, and the output frequency of my DAC is much much lower, I need to implement a digital low pass filter to remove the very high frequency components, to avoid aliasing artifacts. I implemented a 767 order FIR filter using one of the FPGA's DSP Multiply and Accumulate units and some block ram. The DSP unit can compute expressions on the form P = P + A * (B + C) in a single clock cycle. The FIR filter runs at 21.4773 MHz, and generates an output sample rate of 1/384th of that (55.9 kHz). Thanks to FIR filters being symmetrical, I only need to perform 768/2 = 384 multiplications per output sample, i.e. exactly one multiplication per clock cycle.
PPU (Picture Processing Unit)
Generates output pixels and VGA control signals so the picture can be output on a standard PC monitor. I use the output resolution 512x480, so I implemented a scanline doubler in the FPGA, so that one PPU pixel becomes a 2x2 pixel on the VGA screen. The PPU implements the NES tile engine, where the screen is made up of a 32x30 8x8 pixel tiles, and there can be 64 sprites each either 8x8 or 8x16 pixels. The PPU aims to be completely cycle accurate, or else there will be glitches in certain games that depend on the exact NES clock timing.
Mappers
So far I've only implemented a subset of all mappers that exist.
MMC1: Nintendo ASIC, used in Megaman 2, Zelda, Metroid.
MMC2: Nintendo ASIC. The only game that uses this seems to be Punch Out. It contains some dynamic bank switching to allow for more tiles than normal to be displayed on the screen simultaneously.
MMC3: Nintendo ASIC, used in Megaman 3,4,5, Super Mario Bros 2,3. This mapper features a pretty interesting scanline counter that counts the number of transitions on the address lines. By making some assumptions on the pattern at which the PPU accesses things, an IRQ can be generated to the CPU when a certain scanline is about to be rendered. This is mostly used for split screen effects.
Tepples Multi Discrete Mapper: A single mapper that can be reconfigured to support several different discrete logic mappers. I reuse this code path to implement support for the UxROM, CNROM, AxROM mappers.
Results
Here is a photo of the result:
Spartan 6 LX16 Utilization:
Nintendo Entertainment System (NES) was created back in 1985 by the famous Japanese company Nintendo. It was an extremely revolutionary console at its time and was the best selling console for a number of years. I used to play NES a lot as a kid, so I have all those memories of the old games, and had an imminent urge to dig into the inner details of the console to figure out how it worked.
I got myself a Digilent Nexys 3 FPGA development board. It's a ready made FPGA board with built in Flash, RAM, USB programming interface, and power supply circuitry. This is a really quick and easy path into the world of FPGAs. The alternative could be to make the whole PCB from scratch, and that is something I want to learn some day. A guy named Kevtris has done exactly that with his FPGA Game Console.
There are two competing hardware description languages, VHDL and Verilog. I chose to learn Verilog because VHDL seemed unnecessarily verbose while Verilog offered a much more compact syntax.
The NES contains a Ricoh 2A03 CPU, virtually identical to the MOS 6502 CPU, but includes an on chip APU (Audio Processing Unit), while removing some CPU features such as Binary Coded Decimal arithmetic, supposedly to avoid paying patent royalties. Side by side to the CPU is a PPU (Picture Processing Unit), which is responsible for generating a 256x240 sized image. The console has extremely small amounts of memory with today's standards. It has a built in 2kB RAM for the CPU, and another 2kB used by the PPU. Additionally, some carts provided a few kilobytes of extra memory.
Since the CPU uses only a 16-bit address bus, it was common to include mapper chips inside of the cartridge. These are used to implement a mapping (or windowing) mechanism, so the CPU can select which part of the ROM to map into the address space. There exist a plethora of mapper configurations, around 200 different configurations or so. In some cases these were regular discrete logic chips such as 74LS161, while in other cases it was custom made ASICs. In order to support all NES games, I need to implement all those mapper chips too, as the FPGA NES uses an iNES ROM image instead of a physical cartridge.
The NES PPU runs off of a 21.4773 MHz crystal. This frequency was chosen by the original NES designers because it's the frequency used by NTSC video. These are the frequencies I make use of in my FPGA:
- 100MHz: The frequency of the devboard crystal, this gets converted down to 21.4773MHz by the FPGA's built in Digital Clock Manager (DCM) by first multiplying the frequency by 189, then dividing by 220, and finally dividing by 4.
- 21.4773MHz: The internal frequency the FPGA runs on, is used to clock the PSRAM logic, the VGA scanline doubler, ROM loader and some other top level logic.
- 5.37MHz: Generated by dividing 21.4773 by 4. This is frequency the PPU runs at, and every clock another pixel is outputted by the PPU. In my FPGA, I used clock enables to run the PPU only every 4th cycle, to avoid having multiple clock domains.
- 1.79MHz: PPU clock divided by 3. This clocks the CPU. Implemented as a clock enable that's active only every 12th cycle.
Here's a description of the various components that make up the FPGA NES:
Spartan 6 LX16 FPGA Core:
The heart of the board, the Spartan-6 is the programmable chip that runs the NES. It's a pretty advanced FPGA where the basic building block is a 6 input LUT instead of the more commonly used 4-input LUT used in the Spartan-3 series. Here's an FPGA Logic Cells comparison with more info.
The LUT is used to create an arbitrary function of 6 bits of inputs into one bit of output, i.e. 2^6=64 bits of state. These LUTs get interconnected and state is persisted between clock cycles through the use of flip flops, which each keep 1 bit of memory.
16Mbyte Micron Cellular RAM
This is PSRAM, a hybrid of a cheap DRAM and a more expensive SRAM. It still requires DRAM refresh cycles, but all that is happens behind the scenes so I don't need to worry about it. The RAM has an access time of 70ns, meaning that after I put the address out on the bus, I need to wait 70ns before the data is available for me to read. In my RAM logic, I wait 2 cycles (90ns). I timed it so I perform one single byte fetch per PPU clock.
The NES has separate address buses for the CPU (code) and the PPU (graphics). Since I use a single RAM chip, I'm time multiplexing the accesses. The PPU accesses ram only every second clock cycle, i.e. 50% of the time, while CPU accesses every third PPU cycle (33% of the time).
8-bit VGA
This is a rather mediocre video solution, as it only gives me 3 bits for Red, 3 bits for Green and 2 bits for Blue. This only allows for 4 levels of grayscale, so I can't even represent all NES colors accurately (see the picture). I need to move away from this once I figure out a better solution. I really wish the Digilent board would have used a proper video DAC instead, such as Analog Devices ADV7123.
PMOD Connector
These are Digilent's own expansion connectors. I created some simple audio circuitry using a 16 bit AD1866 Audio DAC connected to a single supply Operational Amplifier. The Audio DAC has a bit limited voltage range, 1.5V - 3.5V, so the peak to peak voltage after the opamp is only around 1 volt or so. I discovered the hard way by trial and error that I had to connect an electrolytic capacitor in series with the speaker to act as a high pass filter and remove the DC component to get the average voltage down to 0 volts instead of 2.5 volts.
USB Interface
The Digilent USB Interface can also be used as a bidirectional communications channel between the FPGA and a program running on the PC. The only controllers I had were USB SNES controllers, so I hooked the controller up to the PC and have a small program that reads the joystick movements and sends these over the cable to the FPGA.
6502 CPU Core
I wrote my own 6502 Verilog core, it seemed like a fun challenge and was a lot more educational and challenging than just stealing someone else's. The CPU contains an ALU, a set of registers, and a datapath structure that connects these together. Each instruction runs in a variable number of cycles, and a big microcode table controls which muxes and control signals that are active for every cycle of an instruction. There exists a lot of good documentation of how the 6502 worked down to the cycle level.
Interestingly, the 6502 had a bunch of undocumented instructions. These exist because instructions are only partially decoded by a Decode PLA, and the unused opcodes trigger random control signals in the CPU, causing unplanned things to happen. I implemented most of these, except for some very esoteric ones.
APU (Audio Processing Unit)
The NES APU contains 5 tone generators: Two square wave generators, one triangle generator, one noise generator and a DMC (Delta Modulation Channel). These are periodically controlled by the CPU by writing to a few IO registers in the APU. The outputs of these 5 channels are then combined in a way to mimic the audio hardware of the NES, by using a few look up tables implemented on the FPGA's block ram.
Because the sample frequency of the APU is 1.79MHz, and the output frequency of my DAC is much much lower, I need to implement a digital low pass filter to remove the very high frequency components, to avoid aliasing artifacts. I implemented a 767 order FIR filter using one of the FPGA's DSP Multiply and Accumulate units and some block ram. The DSP unit can compute expressions on the form P = P + A * (B + C) in a single clock cycle. The FIR filter runs at 21.4773 MHz, and generates an output sample rate of 1/384th of that (55.9 kHz). Thanks to FIR filters being symmetrical, I only need to perform 768/2 = 384 multiplications per output sample, i.e. exactly one multiplication per clock cycle.
PPU (Picture Processing Unit)
Generates output pixels and VGA control signals so the picture can be output on a standard PC monitor. I use the output resolution 512x480, so I implemented a scanline doubler in the FPGA, so that one PPU pixel becomes a 2x2 pixel on the VGA screen. The PPU implements the NES tile engine, where the screen is made up of a 32x30 8x8 pixel tiles, and there can be 64 sprites each either 8x8 or 8x16 pixels. The PPU aims to be completely cycle accurate, or else there will be glitches in certain games that depend on the exact NES clock timing.
Mappers
So far I've only implemented a subset of all mappers that exist.
MMC1: Nintendo ASIC, used in Megaman 2, Zelda, Metroid.
MMC2: Nintendo ASIC. The only game that uses this seems to be Punch Out. It contains some dynamic bank switching to allow for more tiles than normal to be displayed on the screen simultaneously.
MMC3: Nintendo ASIC, used in Megaman 3,4,5, Super Mario Bros 2,3. This mapper features a pretty interesting scanline counter that counts the number of transitions on the address lines. By making some assumptions on the pattern at which the PPU accesses things, an IRQ can be generated to the CPU when a certain scanline is about to be rendered. This is mostly used for split screen effects.
Tepples Multi Discrete Mapper: A single mapper that can be reconfigured to support several different discrete logic mappers. I reuse this code path to implement support for the UxROM, CNROM, AxROM mappers.
Results
Here is a photo of the result:
Spartan 6 LX16 Utilization:
Subscribe to:
Posts (Atom)