[Top] [Prev] [Next] [Bottom]
"Knowledge and timber shouldn't be much used till they are seasoned."
Oliver Wendell Holmes, The Autocrat of the Breakfast
After implementing the props-based interface, I felt encouraged by the enthusiastic informal evaluations offered by neurosurgeons and others. The system was also well received as a point design by the human-computer interaction community, offering additional encouragement, and my survey of techniques and issues across a broad selection of systems and experiments had suggested some areas worthy of further exploration. However, to make some salient general points about interface design, I felt that my research needed to move beyond the informal approaches used so far, and perform some formal evaluations under experimentally controlled conditions.
My goal was to move beyond point design and to introduce some careful scientific measurement of relevant behavioral principles. Even given that one wants to "evaluate" a system, there are many possible strategies for evaluation that one might choose. The purpose of this chapter is to outline some of the possibilities, to provide a rationale for the research methodology which was ultimately chosen, and to discuss the process for applying that methodology.
5.2 Evaluation with experts versus non-experts
My work has focused on three-dimensional interfaces for neurosurgeons, who are clearly a form of expert user. But what exactly is meant by an expert user? What are some of the issues raised by working with expert users, and evaluating the results of that work? There are at least three different types of experts, plus the category of non-experts, to consider:
Domain experts: These are experts, such as my neurosurgeon collaborators, who have thorough expertise and experience in a particular field or activity. Performing evaluations with domain experts is appropriate when the goal is to develop tools for the domain application and to demonstrate that those tools can improve current practice.
Interface experts: These are expert users who are proficient with a particular computer interface or set of tools. For example, Card  reports experiments with word processing experts. Card's goal is to develop "an applied cognitive science of the user"  which proposes some models of the human as a cognitive processor. In the context of Card's studies, interface experts are ideal candidates because they exhibit less variable behavior, which is consistent with cognitive skill rather than the searching behavior of novices performing problem solving.
Manipulation experts: These are individuals with great dexterity for skilled manual behaviors, such as painting, sculpting, or playing a violin. Studies of these people and the incredible things they can do with their hands might be helpful for answering questions of how one can build tools that help people to develop comparable skills.1
Non-experts: These are users who may not share a common domain knowledge, may not have any experience with a task of interest, nor will they necessarily have a clear goal in mind with regards to an interface or technology being evaluated. Evaluation with non-experts is appropriate when an artificial goal or task can be introduced and the intent is to see if people can "walk up and use" an interface to accomplish that task. Non-experts are also appropriate for experimental testing of behavioral hypotheses about humans in general.
Neurosurgeons clearly are domain experts, and many neurosurgeons might also be considered manipulation experts because of the fine manual skill required during delicate surgery. Using neurosurgeons for evaluation imposes some constraints on what type of evaluation can be done. Neurosurgeons have heavily constrained schedules, and the available user community of neurosurgeons is quite limited.
5.3 Approaches for evaluation
Given the above constraints, during the planning stages for this research, I considered three general evaluation strategies which might be used2:
Informal usability testing: Demonstrate the interface to domain experts and solicit comments (verbally or through questionnaires) to assess how well the interface meets the task needs of the domain expert. This form of evaluation is essential to develop a useful tool for the domain expert, provides rapid feedback which is well suited to an iterative design process, and is helpful when forming initial hypotheses about factors which can influence the design. Informal usability testing cannot answer general questions as to why an interface might be better than alternative techniques, nor can it address specific experimental hypotheses.
Use by domain experts for real work: The ultimate proof of any tool is for a group of domain experts to use it to achieve goals in the process of their real work. If the domain experts say it is useful, then the tool is declared a success. This approach has been advocated by Fred Brooks . For neurosurgery, the ideal test would be to deploy a tool in the clinical routine and to plan surgical interventions on real patients. This requires development and support of a commercial-quality tool which has been carefully tested for robustness and safety.
Formal Experimentation: Formal experimentation allows careful study of specific hypotheses with non-expert subjects under controlled conditions. Formal experimentation requires introduction of abstract tasks that non-experts can be trained to do quickly and which are suited to the experimental hypotheses, but which may or may not be directly analogous to actual tasks carried out by domain experts.
I decided that my primary goal for this dissertation was to make some general points about interface design and human behavior, so that some of the lessons I had learned in the neurosurgery application could be applied to other interface designs. The formal experimentation strategy best meets the requirements to achieve this goal: ample non-expert subjects are available for experimental testing of hypotheses about human behavior.
Even though formal experimentation is my primary approach, this work as a whole includes elements of all three strategies outlined above. I have performed extensive informal testing with domain experts to drive the interface design itself. Furthermore, although the interface is a research tool and not a clinical tool, it has been tested in the context of actual surgical procedures with real patients, in conjunction with our laboratory's surgical planning software , and Multimedia Medical Systems  is currently working to develop a commercial version of the interface for clinical use.
5.4 Principled experimental comparisons
The formal experimentation strategy can only make general points about interface design and human behavior when a principled approach is taken. A careless experimental design is subject to many pitfalls. A pitfall of particular concern when attempting to evaluate and compare user interfaces is known as the A vs. B comparison pitfall. In such evaluations, the purpose is typically to demonstrate that interface A is "superior to" interface B. But unilateral, unqualified statements of this form are almost always meaningless. Interface or input device comparisons should be made in the context of a specific task or set of tasks, and in the context of a specific class of intended users.
Buxton  presents the example of two drawing toys: an Etch-a-Sketch and a Skedoodle. The Etch-a-Sketch has two separate one-degree-of-freedom knobs to control the motion of the stylus, while the Skedoodle has a joystick which allows one to manipulate both stylus degrees-of-freedom simultaneously. The "research question" is this: Which toy has the better interface for drawing? For drawing one's name in cursive script, the Skedoodle excels. But for drawing rectangles, the Etch-a-Sketch is superior. The point is that neither toy is unilaterally "better for drawing," but rather that each style of interaction has its own strengths and weaknesses.
Another related fault of A vs. B comparisons is that they typically offer no insight as to why one interface differs from another. An unprincipled comparison of competing interfaces can easily confound independent experimental factors, making results difficult to interpret or generalize. For example, concluding that "touchscreens are easiest to use" from a comparison of a touchscreen and a mouse confounds the independent factors of absolute versus relative control, direct versus indirect input, device acquisition time, and the required accuracy of selection, among others. One must carefully formulate specific experimental hypotheses or predictions. Actually testing the hypotheses might still involve a carefully controlled comparison of alternative interfaces, but with the goal of testing specific hypotheses in mind, one can design principled evaluations which demonstrate the fundamental mechanisms or human capabilities at work, and thereby suggest new possibilities for design.
5.5 The process of experimental evaluation
There is a general pattern which should be followed in experimental design. It is vital to begin with principled hypotheses upon which to base an experimental comparison. Without some theory to guide the experimental design and the interpretation of experimental data, it is difficult to draw any firm conclusions.
During the initial stages of experimental design, pilot studies are conducted on a small number of subjects. Pilot studies for experiments are much like throw-away prototypes for software systems. The goal is to rapidly discover major surprises or flaws in the concept for the experiment before investing large amounts of time in a more formal study. If one is not getting the expected results, why not? Is the experiment fundamentally flawed, or are there minor problems with the experimental design? Pilot studies also allow one to work out the details of the experiment, such as the specific instructions to give to subjects or the amount of time to allow for each trial. The pilot studies drive modifications and improvements to the experimental design, in an iterative process which may go through several cycles.
The final formal study requires collecting the data for a meaningful sample size. The data from pilot studies can be used to calculate an effect size in terms of standard deviation units (the difference between means divided by the standard deviation). The effect size, in turn, can be used to estimate a sample size which will yield sufficient statistical power. Statistical power is the probability that an effect of a given magnitude can be found with a sample of fixed size, assuming that the effect really exists . If the sample size is too small, statistical power is low, and the probability of detecting an effect, even if that effect actually exists, will go down. Thus, the pilot studies also serve to ensure that the final study will not fail to find an effect because of lack of statistical power.
5.6 Data analysis
The formal study provides the data, but data analysis is still necessary to demonstrate the experimental hypotheses. Data analysis requires a careful and thorough exploration of the data. The data analyses in this dissertation use standard analysis of variance (ANOVA) techniques, which perform a linear least squares fit of a model (derived from the experimental hypotheses) to the data .
Linear analysis of variance makes several assumptions about the data, each of which must be checked . For example, the errors (differences between the predicted model and the observed data) should be normally distributed, and the errors should not be correlated with the predicted values. Also, the data must be checked for outliers (such as trials where a computer glitch may have occurred, or the subject might have sneezed) which could unduly bias the analysis. These details of analysis are not addressed in the context of the individual experiments presented here, but the statistical confidences reported are based on a thorough analysis.
All of the experiments described in this thesis use within-subjects designs counterbalanced for order of presentation. This means that each subject performs all experimental conditions, but that the order in which subjects perform the conditions is systematically varied. This is sometimes referred to as a latin squares design. A latin squares design helps to ensure that the results will not be biased by order of presentation effects, because order is an explicit between-subjects factor that can be analyzed separately.
For example, imagine that we are designing an experiment that compares condition A with condition B. In a within-subjects design, each subject will perform both condition A and condition B, yielding two groups of subjects: the A-condition-first subjects and the B-condition-first subjects. These groups will be balanced so that half of the subjects try condition B before condition A, while the other half try condition A before condition B. This can help to ensure that any detected difference between condition A and condition B is not entirely due to the order in which subjects performed the conditions: the effects of order are controlled and can be explicitly analyzed.
Figure 5.1 Example Latin Square for two experimental conditions.
Within-subjects designs typically use a repeated measures technique for analysis of variance. Each subject performs multiple conditions, and therefore contributes multiple data observations. But these are not truly separate observations because the observations from a single subject will tend to be correlated: for example, if a subject performs faster than average during one condition, it is more likely that the subject will perform faster for other conditions as well. Repeated measures analysis of variance takes this repetition of observations into account when computing statistical confidence levels.
The statistical results given in my formal experiments make frequent mention of the F statistic and p, the statistical confidence. Both of these statistics are related to testing a model to see how well it predicts a given set of observations. The observed data values Y are the sum of the value Yp predicted by the model plus unexplained error E:
Adding a variable to a model either helps to explain some additional error that was unexplained before, or it has no predictive power, contributing nothing to the model. For statistical analysis, the relevant question is whether or not adding a parameter to a model accounts for more error than one would expect by random chance alone. The F statistic is a ratio which quantifies this. F is the sum of squared error accounted for by adding a new variable to a model divided by the sum of squared error one would expect to account for by adding a random variable to a model. The F statistic can be used to look up p, the statistical confidence for the model. The statistical confidence p is the probability that random selection of N observations from a population resulted in a F statistic of a given magnitude, given the number of parameters (or degrees-of-freedom) of the model. If p is low enough (usually p < .05 is considered sufficient), then the statistical analysis suggests that the F statistic did not result from a deviation caused only by random selection, and therefore one concludes that adding the new variable to the model significantly improves prediction of the observed data, with a confidence level of p.
This chapter has provided a discussion of the issues of working with expert domain users and has demonstrated the rationale and process for the experimental work described in subsequent chapters. Each evaluation technique has its own advantages and disadvantages. For the purpose of extracting general design principles and knowledge about human behavior, evaluation with non-experts is a suitable strategy. This thesis offers some elements from each of the informal usability testing, use by domain experts for real work, and formal experimentation strategies, providing a synergy of real-world application and careful scientific measurement. The remaining chapters will now focus on the formal experiments.
Bill Verplank of Interval Research suggested the concept of "manipulation experts" during the ACM CHI'96 Workshop on Virtual Manipulation in April 1996.
I would like to acknowledge personal communication with Rob Jacob which discussed these approaches.
[Top] [Prev] [Next] [Bottom]
Copyright © 1996, Ken Hinckley. All rights