Lexical Semantics Toolkit & Dataset

The goal of this project is to provide easily usable models for lexical semantic relations, which have been developed at Microsoft Research. Currently the models include heterogeneous vector space models for measuring semantic word relatedness and the polarity inducing latent semantic analysis (LSA) model that judges whether two words or synonyms or antonyms.


This dataset contains various word vectors and lists for a variety of lexical semantic models. Measuring whether two words have a particular relation is typically very straightforward and can be done by computing the cosine score of the corresponding vectors. Information on how these vectors are created can be found in [Yih & Qazvinian, NAACL-HLT-2012] and [Yih, Zweig & Platt, EMNLP-CoNLL-2012]. Please refer to the ReadMe file for more detail of this dataset.

  • Wen-tau Yih, Geoffrey Zweig, and John Platt, Polarity Inducing Latent Semantic Analysis, in Proceedings of Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Association for Computational Linguistics, 12 July 2012
  • Wen-tau Yih and Vahed Qazvinian, Measuring Word Relatedness Using Heterogeneous Vector Space Models, in Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT-2012), Association for Computational Linguistics, 2 June 2012