"Knowledge and timber shouldn't be much used till they are seasoned."Oliver Wendell Holmes, The Autocrat of the Breakfast

My goal was to move beyond point design and to introduce some careful scientific measurement of relevant behavioral principles. Even given that one wants to "evaluate" a system, there are many possible strategies for evaluation that one might choose. The purpose of this chapter is to outline some of the possibilities, to provide a rationale for the research methodology which was ultimately chosen, and to discuss the process for applying that methodology.

*Domain experts:* These are experts, such as my neurosurgeon collaborators, who have thorough expertise and experience in a particular field or activity. Performing evaluations with domain experts is appropriate when the goal is to develop tools for the domain application and to demonstrate that those tools can improve current practice.

*Interface experts:* These are expert users who are proficient with a particular computer interface or set of tools. For example, Card [38] reports experiments with word processing experts. Card's goal is to develop "an applied cognitive science of the user" [38] which proposes some models of the human as a cognitive processor. In the context of Card's studies, interface experts are ideal candidates because they exhibit less variable behavior, which is consistent with cognitive skill rather than the searching behavior of novices performing problem solving.

*Manipulation experts:* These are individuals with great dexterity for skilled manual behaviors, such as painting, sculpting, or playing a violin. Studies of these people and the incredible things they can do with their hands might be helpful for answering questions of how one can build tools that help people to develop comparable skills.^{1}

Neurosurgeons clearly are domain experts, and many neurosurgeons might also be considered manipulation experts because of the fine manual skill required during delicate surgery. Using neurosurgeons for evaluation imposes some constraints on what type of evaluation can be done. Neurosurgeons have heavily constrained schedules, and the available user community of neurosurgeons is quite limited.

*Use by domain experts for real work: *The ultimate proof of any tool is for a group of domain experts to use it to achieve goals in the process of their real work. If the domain experts say it is useful, then the tool is declared a success. This approach has been advocated by Fred Brooks [20]. For neurosurgery, the ideal test would be to deploy a tool in the clinical routine and to plan surgical interventions on real patients. This requires development and support of a commercial-quality tool which has been carefully tested for robustness and safety.

I decided that my primary goal for this dissertation was to make some general points about interface design and human behavior, so that some of the lessons I had learned in the neurosurgery application could be applied to other interface designs. The formal experimentation strategy best meets the requirements to achieve this goal: ample non-expert subjects are available for experimental testing of hypotheses about human behavior.

Even though formal experimentation is my primary approach, this work as a whole includes elements of all three strategies outlined above. I have performed extensive informal testing with domain experts to drive the interface design itself. Furthermore, although the interface is a research tool and not a clinical tool, it has been tested in the context of actual surgical procedures with real patients, in conjunction with our laboratory's surgical planning software [65][66][160], and Multimedia Medical Systems [122] is currently working to develop a commercial version of the interface for clinical use.

Buxton [31] presents the example of two drawing toys: an Etch-a-Sketch and a Skedoodle. The Etch-a-Sketch has two separate one-degree-of-freedom knobs to control the motion of the stylus, while the Skedoodle has a joystick which allows one to manipulate both stylus degrees-of-freedom simultaneously. The "research question" is this: Which toy has the better interface for drawing? For drawing one's name in cursive script, the Skedoodle excels. But for drawing rectangles, the Etch-a-Sketch is superior. The point is that neither toy is unilaterally "better for drawing," but rather that each style of interaction has its own strengths and weaknesses.

During the initial stages of experimental design, pilot studies are conducted on a small number of subjects. Pilot studies for experiments are much like throw-away prototypes for software systems. The goal is to rapidly discover major surprises or flaws in the concept for the experiment before investing large amounts of time in a more formal study. If one is not getting the expected results, why not? Is the experiment fundamentally flawed, or are there minor problems with the experimental design? Pilot studies also allow one to work out the details of the experiment, such as the specific instructions to give to subjects or the amount of time to allow for each trial. The pilot studies drive modifications and improvements to the experimental design, in an iterative process which may go through several cycles.

The final formal study requires collecting the data for a meaningful sample size. The data from pilot studies can be used to calculate an effect size in terms of standard deviation units (the difference between means divided by the standard deviation). The effect size, in turn, can be used to estimate a sample size which will yield sufficient statistical power. Statistical power is the probability that an effect of a given magnitude can be found with a sample of fixed size, assuming that the effect really exists [93]. If the sample size is too small, statistical power is low, and the probability of detecting an effect, even if that effect actually exists, will go down. Thus, the pilot studies also serve to ensure that the final study will not fail to find an effect because of lack of statistical power.

Linear analysis of variance makes several assumptions about the data, each of which must be checked [12][93]. For example, the errors (differences between the predicted model and the observed data) should be normally distributed, and the errors should not be correlated with the predicted values. Also, the data must be checked for outliers (such as trials where a computer glitch may have occurred, or the subject might have sneezed) which could unduly bias the analysis. These details of analysis are not addressed in the context of the individual experiments presented here, but the statistical confidences reported are based on a thorough analysis.

For example, imagine that we are designing an experiment that compares condition A with condition B. In a within-subjects design, each subject will perform both condition A and condition B, yielding two groups of subjects: the A-condition-first subjects and the B-condition-first subjects. These groups will be balanced so that half of the subjects try condition B before condition A, while the other half try condition A before condition B. This can help to ensure that any detected difference between condition A and condition B is not entirely due to the order in which subjects performed the conditions: the effects of order are controlled and can be explicitly analyzed.

## Order |
## First Condition |
## Second Condition |
---|---|---|

A-condition-first subjects | A | B |

B-condition-first subjects | B | A |

Figure 5.1 Example Latin Square for two experimental conditions.

Within-subjects designs typically use a *repeated measures* technique for analysis of variance. Each subject performs multiple conditions, and therefore contributes multiple data observations. But these are not truly separate observations because the observations from a single subject will tend to be correlated: for example, if a subject performs faster than average during one condition, it is more likely that the subject will perform faster for other conditions as well. Repeated measures analysis of variance takes this repetition of observations into account when computing statistical confidence levels.

The statistical results given in my formal experiments make frequent mention of the *F statistic* and *p*, the *statistical confidence*. Both of these statistics are related to testing a model to see how well it predicts a given set of observations. The observed data values *Y* are the sum of the value *Yp* predicted by the model plus unexplained error *E*:

Adding a variable to a model either helps to explain some additional error that was unexplained before, or it has no predictive power, contributing nothing to the model. For statistical analysis, the relevant question is whether or not adding a parameter to a model accounts for more error than one would expect by random chance alone. The *F* statistic is a ratio which quantifies this. *F* is the sum of squared error accounted for by adding a new variable to a model divided by the sum of squared error one would expect to account for by adding a random variable to a model. The *F* statistic can be used to look up *p*, the statistical confidence for the model. The statistical confidence *p* is the probability that random selection of *N* observations from a population resulted in a *F* statistic of a given magnitude, given the number of parameters (or degrees-of-freedom) of the model. If *p* is low enough (usually *p* < .05 is considered sufficient), then the statistical analysis suggests that the *F* statistic did not result from a deviation caused only by random selection, and therefore one concludes that adding the new variable to the model significantly improves prediction of the observed data, with a confidence level of *p*.

^{2}
I would like to acknowledge personal communication with Rob Jacob which discussed these approaches.

[Top] [Prev] [Next] [Bottom]