We propose a procedure for sampling representative phrases from any large corpus so that text input researchers can curate their own stimuli for tasks, domains and languages they wish to target using publicly accessible resources. The procedure is based on grounding the notion of representativeness in terms of information theory. Here, you can read the paper and download the code and data.
- Tim Paek and Bo-June Paul Hsu, Sampling Representative Phrase Sets for Text Entry Experiments: A Procedure and Public Resource, in CHI 2011, ACM, 7 May 2011.
- Tim Paek and Bo-June Hsu, Sampling Representative Phrase Sets for Text Entry Experiments: A Procedure and Public Resource, ACM Conference on Computer-Human Interaction, 2011.
If you have any questions about the downloads, please feel free to contact us. Furthermore, if there are datasets for which you would like to obtain representative phrase sets (e.g., general web, Wikipedia, etc.), please let us know as well. We will use this project page to post requested stimuli. Thank you for your interest.