tomek's blog now I'll lie

Why did I spend 1.5 months creating a Gameboy emulator?

For me, the most favorite type of a computer program is an emulator. Being able to run code from a completely different hardware architecture always seemed like a magic. The old computers are great on their own, so this kind of connection between historic machines and the modern computing environment feels almost like a time travel.


As a developer I often think about the internal design of an emulator. I imagined this big switch construct that chooses the right operation for the current CPU opcode and the array modelling the memory. After watching The Ultimate GameBoy talk found on the Hacker News, I realized that the Gameboy architecture is quite simple and maybe writing a running emulator for this kind of machine wouldn’t be that hard - especially that it’s well documented too.

I realized later that it wasn’t exactly the truth - creating a working program was quite a challenge. At the same time it was one of the most engaging and rewarding side-projects I’ve ever had. It was also very addicting - every time I had a few minutes during the day and basically every evening I felt an irresistable urge to move the emulation a bit forward - fix this strange GPU bug, pass one more compatibility test or implement one more missing feature.

My family members who saw this and friends who heard about this from me often asked if this hasn’t been done before. Of course it was - there’s a dozen of existing GB emulators, created in JS, C, Java, Rust, Python and probably all the other languages. However, it wasn’t the point to conquer the GB emulators market, but rather to completely understand the architecture of a simple computer system and implement it in a performant way. I learnt a lot during the process:

  • how does the CPU work and what are the opcode cycles,
  • how does the GPU display pixels on the screen, what are the HBlank and VBlank,
  • how is the sound being generated,
  • how do the interrupts work and why they’re useful.

For all of these points I had a general idea before, but actually implementing them required a deeper understanding.

CPU and memory

First I implemented all the Gameboy CPU opcodes. It’s not exactly the Z80, but it’s pretty close. As a reference, I’ve used the GameBoy CPU Manual - later on I discovered that it has a few typos and is not specific enough for some of the operations. After implementing the opcodes and memory (modelled by an int[] array) I was eager to check whether it’s possible to run some code on it. I had neither the cartridge nor the GPU emulation yet, so even the simplest game wouldn’t be an option. However, when the original Gameboy starts, it executes a simple 256-bytes program, a kind of firmware displaying the Nintendo logo and self-testing the system. That was exactly the thing I was looking for - after just 3 days I had an application running the Gameboy code!

public void testBoot() {
    Mmu mmu = new Mmu();
    Cpu cpu = new Cpu(mmu);
    while (true) {

CPU timing

At this point I believed that the CPU does what it should do. Now I had to make sure that the work is being done in a timely manner. The length of each CPU and GPU operation is strictly defined and the games written for Gameboy depends on this timing. Eg. the LD A, (a16) CPU operation takes 16 ticks of the internal clock, while displaying a single line takes 456 ticks. The common approach adopted by the emulators is to hard-code these cycle values. However this doesn’t seem “right” to me. Instead, I decided to model the CPU and GPU operations in a way that is close to the original hardware, so the timing will be preserved implicitly. For example, rather the defining the LD A, (a16) opcode (which loads a memory byte into the internal register) as follows:

if (opcode == 0xfa) {
    int byte = addressSpace.get(operand);
    cpuCycles += 16;

I decided to split into some kind of micro-operations, as the real CPU does. In general, each micro-operation that accesses the memory takes 4 cycles:

// reading the opcode byte takes 4 cycles
microOps.add(new Read16BitOperand()); // loads two bytes -> takes 8 cycles
microOps.add(new LoadByteFromMemory(operand)); // loads one byte -> takes 4 cycles
microOps.add(new StoreToRegister("A")); // takes no extra cycles

Later on I created a simple DSL that made defining CPU opcodes easier:

regCmd(opcodes, 0xfa, "LD A,(a16)").load("(a16)").store("A");

This also allowed to split the execution of a single opcode into a few phases, so the GPU and other parts of the Gameboy system takes chance to run their own logic between the load("(a16)") and store("A"), etc. The micro-operations themselves are atomic. Also, having this DSL is place made defining all the opcodes much easier, as we can group them together:

for (Entry<Integer, String> t : indexedList(0x0b, 0x10, "BC", "DE", "HL", "SP")) {
    regCmd(opcodes, t, "DEC {}").load(t.getValue()).alu("DEC").store(t.getValue());

Basic GPU implementation

The next step was implementing the basic graphics support. Gameboy displays two kinds of graphics: background (which is usually more static) and sprites (moving objects, like Mario or falling block in Tetris). The bootstrap code displays a scrolling “Nintendo” logo using the basic GPU features, so I already had a way to test the first implementation.

For the GPU, I also wanted to have it emulated closely to the original. The aforementioned Ultimate GameBoy talk contains a pretty good explanation on how the Gameboy copies video memory into screen, using the fetcher. I implemented the fetcher, pixel FIFO, the background support and bang! - I can see a beautiful “Nintendo” logo scrolling down the screen:

Nintendo Logo

Gameboy subsystems

My next goal was running a simple game - like Tetris or Dr Mario. ROM size for these games is just 32k, so there’s no need to implement memory bank switching. I still missed a few features to have it working, but they are rather simple, assuming that we already have the emulator stub working:

  • joypad,
  • timer,
  • DMA,
  • sprites,
  • interrupts.

Joypad and timer are really simple subsystems. They can be accessed using just a few bytes in the memory and their protocol is unambiguous.

The sprite support is divided into two parts: finding objects that should be displayed in the current line (“OAM search” phase) and actually putting them to the LCD. The GPU implementation was already started and I had the placeholders to put the appropriate logic. The main challenge was working out which pixel should be on displayed in case there’s a few sprites overlaying each other or there’s a non-empty background.

DMA is a mechanism that allows to copy a memory block into the sprite video memory without involving the CPU.

Interrupts support was a challenge - there’s a few registers and they enable and disable the interrupts on a few levels. Eventually, the code wasn’t too complex, but it took my some time to find out how it all should work.

Having all these things implemented, probably in a buggy and incomplete way, I was able to run Tetris. Yay!


Sound and timing

The last big subsystem, waiting to be implemented, was sound. For me, working with the sound is somehow harder than implementing the graphics. Putting pixels on a screen is easy, but generating sound effects from scratch is a different pair of shoes. First I have to decrypt terms like “frequency sweep” or “volume of envelope”. Fortunately, it’s well documented. After reading the GB CPU Manual and the Gameboy sound hardware at least a few times, I was finally able to make the heads or tails of this.

Also, I rely on the Java Audio System to make the emulator run with an accurate speed. I don’t use the Thread.sleep() statements anywhere in the code. Instead, I fill the (very small) audio buffer constantly. If the buffer is full, it plays the sound and the whole emulation is halted until it’s done. As a result, the audio sampling rate defines how quick the emulator should be.

Making it exact with Blargg’s tests

They say that debugging an app is much harder than writing it. That was true for the Coffee GB. After finishing the sound implementation I had an application working with a few games, crashing with a few others, but I was totally unable to say which parts of the program were implemented correctly and where I should look for potential bugs: is it the CPU timing or logic, interrupts, cartridge memory bank switching or GPU implementation, etc. Fortunately I ran into the Blragg’s GB tests - a set of ROMs that test different aspects of the emulated machine: CPU instructions and their timing, graphics, sound and a few hardware bugs (the OAM bug and the halt bug).

CPU test

I think that forcing the emulator to pass all of these tests took my longer than writing the untested version of the code in the first place. At some point I integrated the tests with the build process and Travis, so the ROM-based tests are now being run in the cloud every time I push a change to GitHub.

CPU test in CLI

After fixing the bugs found in the CPU emulation I was finally able to run the Super Mario Land:

Super Mario Land

Ultimately, I had all of the tests green. Most of the game I tried were working fine and if some didn’t, it was easy enough to find a bug - as I was certain that eg. the CPU instructions behave correctly.

Gameboy Color

The last thing on my TODO list was adding the support for Gameboy Color. I somehow expected that the GBC architecture is almost identical to the classic GB’s one, with a few differences (like supporting colors). Wikipedia points out other improvements: more RAM, 2x faster CPU and IR transmitter. In order to get the exact list of things that should be added, I grepped the Pan Docs looking for “CGB”. Query resulted in following list of things that should be available in the GBC mode:

  • 2x faster CPU mode (has to be manually enabled),
  • 4 switchable RAM banks, 8192 bytes each,
  • 2 switchable video RAM banks,
  • new mechanism of memory copying - HDMA.

Of course, there are some improvements in the GPU as well:

  • new palette registers, responsible for the color support,
  • background tiles have their own attributes, defining flipping, palette and priority,
  • the background and sprite priorities are now more robust.

It’s interesting that the changes in the video handling are rather cosmetic - colors were implemented with a minimal effort, which shows how flexible the original Gameboy architecture is.

There’s a few Blargg’s tests meant to be run on the GBC. Also, the Wario Land 3 was a great way to check the new mode:

Wario Land 3


My emulator works quite good with the most of the games I tried, although its performance might have been a bit better. It’s not a competition for the most popular applications in the category, like Gambatte - for this, it lacks of some useful features not related directly to the GB system itself, like a nice GUI, snapshot saves, etc.

So why did I spend all this time trying to write it? I treated it as a programming riddle (or a series of riddles), quite similar to those you can find on the Project Euler. It’s a complex, self-contained problem that can be nicely splitted into stages (as I did in the blogpost) and every stage, once completed correctly, gives a rewarding result. Maybe because of these results I got quite addicted to the project itself. Usually I divide my spare time between watching TV series, reading books, playing video games, etc. This little app practically wiped out all these activities and even if I have 5 minutes of a spare time, I was trying to fix this weird GPU or sound bug. I guess it’s not too healthy (especially if you have a family), but it also doesn’t happen that often. Right now I don’t have any obvious ways to move the app forward, so the addiction dissipated ;)

Tetris DX

See Coffee GB on Github.

Update (May 2018)

Hacker News discussion Reddit discussion