Marc Hedlund: Debugging Hacks, What They Never Taught You About Solving Hard Bugs


Marc Hedlund talks about debugging
Marc Hedlund talks about debugging
(click to enlarge)

There's no doubt that debugging is a critical skill for anyone who codes. Marc Hedlund is talking about how to tackle the really difficult ones. I enjoyed Marc's tutorial from last year, and picked this one on that basis.

Most bugs aren't hard. 95% of the time, you can find a fix easily and move on. Marc's tutorial is about what to do when the simple methods don't work anymore. He gives an example of a login that would fail once every 10,000 times or so. Turns out the problem was a filter that would through out URLs with swear words in them. Finding bugs like that can be hard.

Marc recommmends Why Programs Fail: A Guide to Systematic Debugging . This is a great guide to systematic debugging. Some people are great debuggers. Others can use help.

He uses this example: Segmentation fault using libtidy (symptoms, diagnosis,\tand bush medicine cure. Here's what he did right:

  • Eliminated possible causes and narrowed in
  • Wrote a test case that exercise the bug and discovered Rails was factor
  • Used source code and a debugger to gather data.
  • Noticed a coincidence
  • Reproduced failure in his test case.

Here are some common mistakes:

"That doesn't look right, but it's probably fine." If you think there's a bug, there's a bug. Pay attention to small hints. If you can't find anything file a bug report.

"It seems to have gone away." If you didn't fix the bug, it's still there. If you don't understand what the problem is, it will bite you later.

"I bet I know what this is." Wait o form theories until you have data. Let the data lead you. He quotes Sherlock Holmes: "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts."

"That's impossible." Impossible conditions are often the source of bugs. Set up logging, exceptions,a nd assertions. Make sure you get the report. Make sure you see failure when it occurs. Ignoring the obvious is a good tool. When your Web site produces and exception, send it to the whole engineering team.

"Beats me...probably a race condition." Not all hard problems are race conditions. Usually this means "I don't know." This is forming a theory without data.

"I'm just going to fix this other bug quickly." Don't make any changes until you understand the bug. File and log and bugs you find along the way--but don't fix them. You end up suppressing the first error and missing it.

"That's probably the [server/client] code." Don't guess. Prove it. Be humble--don't assume you're better. If you keep getting wrong reports proof will help.

"I think I found a bug in the OS." In all likelihood, the problem won't be in the libraries of in the operating system. That can happen, but you'd better have pretty good evidence.

"That not usually the problem." Beware of representativeness errors. Sometimes 40-year olds have heart attacks. If the data leads that way, then follow it.

"Oh, I saw this just last week." This is known as an availability error. Third in a week could be an epidemic--or not.

"This guys too smart to make that mistake." Beware of sympathy errors. Even engineers put CDs in upside down. Check the data no matter the source. The opposite is also true: assuming someone's stupid.

"I found a bug and fixed it-done." Finding a bug is different than finding the bug.

"I haven't made any progress--it's still broken." Think of the bug report is a collection of information. Adding data, eliminating theories, and recording changes leads to understanding. Clearing bugs is the end goal, but progress can be represented by other things.

"I've got to get this out now--not time for..." Rushed fixes tend to introduce more bugs. Stick to a good process even if the situation is urgent. Break down suppression and closure.

Here's Marc's general approach to fixing bugs.

  • Revert any changes you made looking for a quick fix - Bring the system to its initial state. People usually try something quick. Getting back to the original condition as quickly as possible is important.
  • Collect data from each of the components involved - Maintain a page with the most concise problem descriptions. State everything you know for a fact. List the questions for which you need answers. Don't delete data; instead move it to a "probably unrelated" section.
  • Reproduce the bug and automate it - You must have access to the reporters environment. Use virtualization and the browser version archives where needed.
  • Simplify the bug conditions as much as possible - Con you reproduce the bug in other circumstances. Can you remove a condition and still see it? Are there any contradictions in the conditions? "We only see this on OSX with IE." Can you separate the problem? Could be an error in the data?
  • Look for connections and coincidences in the data - Build a set of "that looks weird" observations. Describe all the actors and their roles. Parallel timelines can help. Look at data from client and server viewpoints.
  • Brainstorm theories and test them - State each theory separately. Does the theory cover all of the data in the report? Does it explain why the conditions are necessary? Does it cover all the related reports?
  • When you find a fix, verify it against the report - Go back and re-read the whole bug report. Run all of your reproduction test cases.
  • Check that you haven't created new bugs - Very common for one fix to create new bugs. Automated test quites help enormously at this point. If X was failing under condition Y but not Z and it now passes under Y, does it still pass under Z? Often the answer is "no."

These steps almost always work. You might have to go through it several times. You might need several people to make it work. You might decide its too costly. Even so, if you go all the way through this process, you will get a fix.

I missed 45 minutes after the break because of a conference call I had to join. So, there's a gap here in what Marc said and what I heard.

The best predictor of new bugs is change rate. Code that is changing a lot will have a lot of bugs. Direct QA efforts by counting changes per file. Spend time testing the stuff that changed.

The best estimator of code quality is the rate you find bugs. When the find rate goes down, you're ready to ship. You should ignore every other QA measure.

You can four things with each bug

  1. Fix it
  2. Suppress it
  3. Record it and wait for more info
  4. Ignore it

You probably can't always afford (1). Of the rest, (3) is the best option.

There's a culture surrounding bugs. Don't scold people for bugs. Everyone creates bugs. If bugs cause punishment, reports will be killed and there will be severe tension with QA. If there's a chronic problem with bugs from one person, deal with it in person.

Reopen rates measure how development deals with bugs. Lots of reopens is a red flag for process--especially within one release. Reopens indicate that bugs are being hidden rather than closed.

Marc has some book recommendations for people who want to understand debugging better: