Persistent-state Checkpoint Comparison for Troubleshooting Configuration Failures

  • Yi-Min Wang ,
  • Chad Verbowski ,
  • Daniel R. Simon

MSR-TR-2003-28 |

Note : To appear in Proc. IEEE International Conference on Dependable Systems and Networks (DSN) , June 2003.

Publication

Application failures characterized by the phrases, “it worked yesterday, but it doesn’t work today” and “it worked on that machine, but it doesn’t work on this machine” are a major source of computer user frustration and a major component in the total cost of ownership. The typical symptom-based troubleshooting approach relies too much on creative thinking and may lead users or support technicians in directions far from the actual root cause. In this paper, we propose a state-based troubleshooting approach for configuration failures that aims at making the diagnostic process as mechanical as possible. In the narrow-down phase, we use checkpoint comparison and application tracing to determine which pieces of persistent state have changed and are affecting current application execution; ongoing self-monitoring of persistent-state changes by the machine is used to help eliminate false positives. In the solution-query phase, state-to-task mapping and searches of online databases are used to translate low-level state information into highlevel user interfaces and articles. We describe the design and implementation of a troubleshooter that uses this state-based approach and present preliminary results to demonstrate its effectiveness in diagnosing several actual configuration failures.