Machine Learning Day 2013 - Clustering: Probably Approximately Useless; Geometry Preserving Non-Linear Dimension Reduction

Speaker  Rich Caruana and Marina Meila

Host  Ofer Dekel

Affiliation  Microsoft Research-Redmond, University of Washington

Duration  01:06:05

Date recorded  18 October 2013

Clustering: Probably Approximately Useless?, Rich Caruana (MSR)

Clustering never seems to live up to the hype. To paraphrase the popular saying, clustering looks good in theory, yet often fails to deliver in practice. Why? You would think that something so simple and elegant as finding groups of similar items in data would be incredibly useful. Yet often it isn't. The problem is that clustering rarely finds the groups *you* want, or expected, or that are most useful for the task at hand. There are so many good ways to cluster a dataset that the odds of coming up with the clustering that is best for what you're doing are small. How do we fix this and make clustering more useful in practice? How do we make clustering do what you want, while still giving it the freedom to "do its own thing" and surprise us?

Geometry preserving non-linear dimension reduction, Marina Meila (UW Statistics)

In recent years, manifold learning has become increasingly popular as a tool for performing non-linear dimensionality reduction. This has led to the development of numerous algorithms of varying degrees of complexity that aim to recover the underlying low-dimensional parameters of the data using either local or global features.

It is also widely recognized that the low dimensional parametrizations will typically distort the geometric properties of the original data, like distances, angles, areas and so on. These distortions depend both on the data and on the algorithm used.

Building on the Laplacian Eigenmap framework, we propose a new paradigm that offers a guarantee, under reasonable assumptions, that *any* manifold learning algorithm will preserve the geometry of a data set. Our approach is based on augmenting the output of an algorithm with geometric information, embodied in the Riemannian metric of the manifold. The Riemannian metric allows us to compute geometric quantities (such as angle, length, or volume) for any coordinate system or embedding of the manifold. This geometric faithfulness, which is not guarante edfor most algorithms, allows us to define geometric measurements that are inde pendent of the algorithm used, and hence move seamlessly from one algorithm to another. In this work, we provide an algorithm for estimating the Riemannian metric from data and demonstrate the advantages of our approach in a variety of examples.

As an application of this new framework, we develop a new, principled, unsupervised to selecting the scale parameter in manifold learning, based on optimizing the geometric self-consistency w.r.t the scale.

This talk will not require any knowledge of advanced mathematics or manifold learning.

Joint work with Dominique Perrault-Joncas.