Representativeness in Software Engineering Research

MSR-TR-2012-93 |

One of the goals of software engineering research is to achieve generality: Are the phenomena found in a few projects reflective of what goes on in others? Will a technique benefit more than just the projects it is evaluated on? The discipline of our community has gained rigor over the past twenty years and is now attempting to achieve generality through evaluation and study of an increasing number of software projects (sometime hundreds!). However, quantity is not the only important component. Selecting projects that are representative of a larger body of software of interest is just as critical. Little attention has been paid to selecting projects in such a way that generality and representativeness is maximized or even quantitatively characterized and reported. In this paper, we present a general technique for quantifying how representative a sample of software projects is of a population across many dimensions. We also present a greedy algorithm for choosing a maximally representative sample. We demonstrate our technique on research presented over the past two years at ICSE and FSE with respect to a population of 20,000 active open source projects. Finally, we propose methods of reporting objective measures of representativeness in research.