Finding duplicates in a data stream

Given a data stream of length $n$ over an alphabet $[m]$ where $n > m$, we consider the problem of finding a duplicate in a single pass. We give a randomized algorithm for this problem that uses $O((\log m)^3)$ space. This answers a question of Muthukrishnan and Tarui, who asked if this problem could be solved using sub-linear space and one pass over the input. Our main tool is an Isolation Lemma that reduces this problem to the task of detecting and identifying a Dictatorial variable in a Boolean halfspace.

PDF file

In  SODA'09

Publisher  The British Computer Society


> Publications > Finding duplicates in a data stream