v1.0.1
If you're reading this I'll assume that you know that bugs in a program are bad and that we should want to get rid of them. But I'm going to take a step back for a moment to consider the problem. If you want to find a solution, you have to know what problem you're trying to solve, after all.
So just what is a bug anyway?
Well for the purposes of this article I'll define a bug as any program behaviour (output, state changes, side-effects) that is different from what is intended or designed. Put another way, a bug is the difference between a program that works the way we want it to work and a program that doesn't. The difference between the actual and the expected output, for example; often the first way we spot a bug is when the actual program output doesn't match our expected output.
Why are bugs present?
Who put them there, anyway? I keep telling everyone to stop doing that!
Bugs are caused by faulty assumptions. A faulty assumption could be anything from "this number is always positive" to "no one else is writing to this database at the same time as me" to "only one instance of this program will run at a time". Often when we as programmers make these assumptions we aren't even aware of the simplifications we're building into the model, let alone when they might be broken. After all who would ever ask for the square root of a negative number? It's unimaginable! Then at some point our program is run under circumstances where our assumptions fail. Maybe we wanted to open a file but another process is already using that file (or maybe it got moved or deleted), so the program doesn't know what to do and crashes. Our assumption that the file would always be available was wrong.
Okay sure, a bug is unintended behaviour, and they're there because we made some bad guesses.
How do we just fix them already? This is a de-bugging guide and not a bugging guide, isn't it?
[UPDATED] First you need to reproduce the bug. After you can reliably reproduce the bug then you can start investigating it. It's hard to chase what you can't see, after all. You'll also need a way to test or validate to know if/when the bug is fixed, so it's essential to have steps to reproduce the problem.
Go with what you CAN see, what you DO know, and build from there. First you can see the output that you are getting; sure, it's not what you wanted to get, but you get something (or sometimes not, which is interesting too). Depending on what you're doing you might get some kind of Exception, a segmentation fault, some kind of arcane integer error code, nothing whatsoever, or any number of other things. If your error code is specific, you're in luck. Find out what generates that specific error. At some point the instruction pointer / program counter reached that point in execution. Now you have to figure out the context where it got triggered by tracing up the stack. That will point you to some piece of code somewhere that contains your problem. Don't be afraid to trace through any code you have access to (even if it's in a plugin or something) if the error you got traces through there.
What if I don't have a specific error and just get something like a SEGFAULT that doesn't tell me anything useful?
Well now you need to investigate. There is always a reason for what the computer does; any visible effect has an associated cause. If you know every bit of how the program behaves, you would already have solved the issue. If you don't know exactly how it behaves, you need to figure out exactly how it behaves. First you need to know more; collect data. So if the code segfaults figure out WHERE it segfaults. In order to do this you need to observe, so OUTPUT MORE. When in doubt, OUTPUT MORE. You need to observe what the program is actually doing. Figure out the last line that was executed before the segfault, figure out what memory was being read or written there. Make sure you think about what to output, there's no point in putting in meaningless output like "asldkfjsdklfjdkl". Output meaningful data, like the value of the variable in the conditional or other important context about the code's execution. You're trying to figure out why the machine got itself into a state you thought impossible. Leave yourselve breadcrumbs to try to retrace its steps. Look at what you know, what you can see, how the program behaves, and then what the code says. Try to find the FIRST thing that goes wrong. You're looking for the root cause, not the cascading effects. Trace back from the earliest observed incorrect output to find the section of code that causes the issue.
Okay I've traced back the code to find the first incorrect data and faulty operation. Now what?
Well the good news is that the hard part is done. The next question is of course at that line or call, why didn't it behave the way you expected? What was the difference between how you thought it should work and how it actually did work? And what did you assume about the nature of the input that wasn't true at that moment? Was it the size of the memory you were copying? Was it the location of a file? Once we know that we can fix the issue.
Sure, I've figured out what was wrong in the input. But how do I actually fix it?
Well we have to account for that case that was outside of the capabilities our originally-written handling. There isn't a one-size fits all solution here, but here are some things you might do:
-
Option a) Generalize the code where you're at so that it handles the input correctly and gracefully. For example if you're writing a sorting program that crashes on duplicates, update it to handle duplicates. Or if you have the option of using a data structure that grows dynamically versus a fixed-allocation data structure, maybe switch to the dynamic one to prevent overflows.
-
Option b) Prevent the error from ever occurring by restricting the input, either at the point of error or in some previous calling code. Your program can't handle negative numbers? Don't let users enter them. A word of caution though, checks which are client-side may as well not be there; don't depend on other users' computers running your code without changes.
-
Option c) Report the error and continue gracefully. * Sometimes the user tries to divide by zero and that's just not a thing that's possible to do. Roll back to the earliest context before the error and reset things from there. As long as the recovery is quick enough it's okay to even crash and restart the program. If the user wasn't meant to be authorized for that operation return an appropriate FORBIDDEN error, etc. If possible you should tell the user that what they did was a bad idea and that they shouldn't do it again, but try not to be intrusive or break UI flow.
Regardless of what option we use, somehow we have to account for the case that was occurring.
Okay done; I've updated the code to account for that case. I'll just be off now...
Not so fast. Sure, you fixed it for now, you think, for this one particular edge case. But how do you know that the code won't get reverted in the future? Or that some other bug didn't crop up while you were fixing this one? You need some kind of validation. In cases where it's possible, a unit test would suffice. Write a test to make sure if the suspect input is given the program will respond correctly according to the fix you made. And run any existing tests to make sure they all still pass. If there were any other bugs people were aware of, there should be tests for those too. Documentation can also help, to let other programmers know that your fix was intentional and they shouldn't change it.
Okay the test(s) is/are updated. Anything else?
Just learn from it; remember this bug that you've run across so you know how to spot it the next time you see it. And try to spot it while writing code, so you don't have to go through the whole debugging process again next time. This is especially true if your code caused the original bug. Prevention is better than a cure.
* For non-recoverable fatal errors there are multiple schools of thought as to how they should be handled. One idea is to prevent crashes at all cost, and be sure to handle every possible error condition from day one. This is almost never possible in practice. The other school of thought, which I call "Crash and Recover" but is more commonly called "Let it Crash" says the opposite. It says to make crashes so cheap, so easy to recovery from, that your program crashing once in a while isn't a big deal. And then when something inevitably does go wrong, you can let it crash without worrying about it. It restarts quickly and gets back to the screen the user was at, users hardly noticed, and it's not a big deal (think a quick page refresh in a browser). I'll probably write a whole article about this in the future; I think crash-and-recover makes things more reliable.