E0 229 Lectures

Lecture 1 (7/1): Introduction to the course (Ravi)

Lecture 2 (9/1): High Dimensional Spaces (Ramesh) 

Lecture 3 (16/1): High Dimensional Spaces (Ramesh) slides (for Lectures 2 and 3)

Lecture 4 (21/1): High Dimensional Spaces (Ramesh)

Lecture 5 (23/1): High Dimensional Spaces (Ramesh) slides (for Lecture 4 and 5)

Lectures 6-9: SVD (Ravi) scribe notes 1, notes 2

Clustering (Ravi) slides 1, slides 2, slides 3, slides 4

Random walks (Ramesh) slides 1, slides 2

Streaming algorithms (Navin)

slides 1 (counting in loglog space; median of the means estimator; streaming models) 

slides 2 (second frequency moment via AMS sketch; universal hashing)

Optional Links: AMS sketch comes from this paper of Alon--Matias--Szegedy. It's a foundational paper for the modern study of data streams. It discusses both upper and lower bounds. A relatively recent discussion of the state of the art on several streaming problems can be found in the introduction of this paper due to Andoni et al. 2011.

slides 3 (counting distinct elements)

Optional Links: Our second algorithm for counting distinct elements is from this paper due to Bar-Yossef et al. 2002. Our algorithm for counting distinct element in turnstile streams is from these slides of Indyk. And this paper due to Kane et al. 2010 has an optimal algorithm for counting distinct elements for quite general streams.

slides 4 (document sketching; priority sampling)

Optional Links: The original paper of Broder where the notion of document resemblance was proposed along with the permutation based estimator. The Wikipedia page on MinHash is a good source of information on this topic. Priority sampling was introduced in this paper of Duffield et al. This paper of Thorup shows that bottom-k sampling can be done using pairwise independent hashing.

slides 5 (priority sampling; point queries, frequent elements)

Optional Links: Count-min sketch is from this paper due to Cormode and Muthukrishnan.

We're covering only some of the basic aspects of streaming algorithms. There are a number of resources on the web covering much more ground. See, for example, here for a recent course devoted to the topic; it also links to several other courses with notes available online. If you want to attempt open problems in this area, you can find some problems here (some of these have been solved though). And, on a different note, here's a data science course covering practical aspects with almost disjoint content from our course.