Adventures Pursuing the Schrödinbug; or, How to Debug Your Code. Earlier there was a thread about how to debug code, for which I really can't think of a good, succinct answer. There's quite a bit of an art to debugging code. Those of you who are intermediate and expert programmers probably have your own habits, strategies, and tools--but if you're a beginner then there's a grim reality you'll experience soon enough: You'll likely spend
much more time fixing your code than writing it "fresh".
The difficulty in writing a program scales up with its size--I'd say exponentially. So you really don't want to dive into a large project without having a few tricks up your sleeve.
For that reason I thought I'd share what I do to manage the herculean task of debugging something the size of Idosra's
Vallas Engine. The last time I counted, I had around 35,000 lines of code in 1 MB of .hx Haxe files. I've added and deleted a lot since then, but the grand total hasn't changed much. That puts it at around the same storage size as the Stencyl engine, although I tend to be wordier when it comes to comments. So there's a lot of room for something to break.
If When Things go Wrong... The first principle I adhere to is
controlled failure--anticipate where things might go wrong and fail on purpose to prevent a larger catastrophe. A good example of this kind of engineering is the case of the Boeing 737 fuselage. The B737 fuselage is lined with metal strips forming a 10in by 10in grid. Should a break in the fuselage occur in flight, the tear will be stopped by the metal strip, opening a small hole which lets the cabin depressurize. This controlled failure prevents the worse catastrophe of an explosive decompression.
In my case, I have the
debug console which has been featured in the starts of the last few videos I've uploaded. I've posted a few screenshots of the "Guru Meditation" errors, such as the one below.
So what's going on is really pretty simple. Whenever I anticipate that a certain erroneous condition might occur, I put a bit of programming that sets an error code and an explanation string with more information. My rendering routine checks the state of the error codes. If there are no error codes stored, it displays the screen. If there is an error code stored, it kicks out to the debug console and flashes the red "Guru Meditation" (which, BTW, is a play off of the Amiga system's "BSOD" type error).
Most of the time these error codes prevent the much more ambiguous "Error 1009" / "Segmentation Fault" errors that are a pain to diagnose. The first two letters tell me what file to find the broken code in, and the last two give me something to search so I can pull that code up in a text editor. SR is Sector Renderer, and so is an error in the code that draws a room.
The error codes are a means of controlled failure that often keep the running program alive long enough to help me figure out the root cause of a problem. As many of us have experienced, often finding the bug takes much longer than fixing it. Anything that helps you pinpoint errors faster is going to speed up your debugging efforts.
So now... The Schrödinbug. What is a "Schrödinbug"? It's a cute little name for a type of bug--one of many (
See Wikipedia entry). It's a bug that appears to lie dormant until the programmer notices the code shouldn't work. In actuality, it's code that never was correct, but the conditions for which the bug activates were not met.
My Schrödinbug came to light as I was adding code allowing the player to walk out of one map and into another. I modified the 2x2 "testquads" map, that I used for AI testing, to have a map exit. In this case--it exits to another copy of itself, but the actual map used doesn't matter. Exiting forces a scene change, which is what I need to test.
When I added my code to the rest of the player movement code, I noticed something was amiss. But--it's been working all summer. Surely there can't be an issue with it now?
Entering the "testquads" map... all looks okay.
Walk to the room to the south (lower left)...
Where's Marika??? Obviously something's wrong here. Now I do have sound implemented and I knew she was still "alive" because I could hear her footsteps. It's worth explaining real quick how actors in Idosra work. There's a small class hierarchy:
Character Data -> VActor -> Actor
A "character" is an entity that can exist on any map. Characters remain in memory no matter what scene/map is shown. Every character has a "VActor" instance, and every VActor has an "Actor" instance. An "Actor" is what you'd be used to from Stencyl--it's just a Stencyl actor. A "VActor" contains some additional data relevant to 3D physics and 2.5D rendering.
It's possible for a character to not have a VActor, which would be the case if they're on a different map than the player is. When the map changes, I loop through all the characters and create VActors and Actor instances for the characters that are now on stage. Of course, since the player is always on stage, the player should always have a VActor and Actor instance.
The fact that I could "hear" Marika suggested her VActor instance is still alive, since that's where movement code is housed. But what about her Actor? Is it gone? Did it somehow get pushed to the very bottom layer, under the floors? Did it get misaligned with the camera, so it's always off screen? There's so many possibilities here.
I managed to walk her back to the door, to see if she'd reappear in the room she started in:
Nope! At this point I already suspected what might be the problem, but "LD" is the last error I ever want to see. "LD" is "LayerDrawStack", which is the code that solves the
"Sorting Stumper" problem I wrote about in April.
"Elementary," said he.* So here's where we begin the detective work. I gave a technical overview of how rendering works in the "Sorting Stumper" link above, but the short version is this: Walls are drawn on layers and the layers are arranged in the right order from far to near. I compute the minimum number of layers needed to draw the walls correctly, and then compute which layer each wall should be drawn on. For actors, I basically draw them on the layer that the floor they're standing on is. There's a lot more to that if you want to read the details, but that's the gist of it.
When any actor moves through the room, I have to check that they're drawn on the right layer. When this layer changes, I pull them from the old layer and push them onto the new layer.
Error LD03 occurs when the code tries to pull the actor from a layer, but they're not on one. That would show up as either a crash or "Error 1009" without the error code, so LD03 is a lot more useful. I now know why Marika disappeared. She was pulled off of a layer, but never pushed onto a new one. And that error must have occurred when she walked through the door.
So the "controlled failure" really did do its job. Imagine getting an "Error 1009" and trying to locate
that. I think a lot of use have experienced that before.
My next strategy is to go into the suspect section of code (the movement code that handles sector changes) and stick a bunch of trace/print statements. Basically, print out the value of every variable and look for something that is wrong.
So what did it turn out to be? It seemed weird at first that adding a new door somehow "broke" another door, but once I got to the bottom of it, it made sense. It has to do with the "Growing Squares" algorithm that I use to partition the segments of the floors and walls into rectangles.
By adding a new door, it just so happened that the layer in the new room that Marika would enter at was changed to Layer 1. In the starting room, the layer Marika leaves the room from is Layer 1.
Now these are "different" Layer 1's: but the code wasn't checking it. To check if the layer an actor is on needs to be updated, it checks if the layer index has changed. When Marika went from Layer 1 in the old room to Layer 1 in the new room, the code didn't see the two layers as two different layers. Hence, when Marika was plucked from the old room, she was never inserted into the new room. When I walked her back, she couldn't be pulled from the room again and so the error flagged.
I think it's really a story emphasizing how important beta testers are. What are the odds that the layers would align themselves just so precisely as to trigger that bug? I think it's also important to have these kinds of error codes for your released games. Had this game been released and one of my players told me they got an "Error LD03", I have a hope of fixing it. If they just told me "the game crashed", then who knows.
In closing, it took me about four hours to diagnose this bug. The time it took to fix it? 15
seconds.
* Fun Fact:
Sherlock Holmes never said "Elementary, my dear Watson". Also, my last name is "Watson", and I never hear the end of that quote.
