2007 Oct 20 at 10:00
MC 5158
Richard Karp, Dept. of Electrical Engineering & Comp. Sci., Univ. California at Berkeley
Companies such as Yahoo, Google and Microsoft maintain extremely large data repositories within which searches are frequently conducted. There is interest in developing such repositories in the public sector and applying them to massive data set problems of importance to the scientific community and society in general.
It is of interest to develop streaming algorithms for basic information processing tasks within such data repositories. We present such algorithms for selecting the keys of given ranks in a totally ordered set of n keys, and for a related problem of approximate sorting. We derive bounds on the storage and time requirements of these algorithms under the assumption that blocks of keys arrive in a random order, and show that these bounds are close to information theoretic lower bounds for the problems. We assume random arrivals because these repositories support random access to disc blocks. The alpha-quantile of a totally ordered set of n keys is the (alpha n)th smallest element. We present near-optimal algorithms (simultaneously for time and storage), under the random arrivals assumption, for the following problems:
1. Selection: compute the alpha-quantile for a given alpha;.
2. Multiple selection: compute alpha-quantiles for many given values of alpha;.
3. Parallel selection, in which the input is divided into streams, each with its own buffer, and the different streams communicate by message passing.
4. Approximate selection: given alpha and epsilon;, find a key whose rank differs from (alpha n) by at most (epsilon n).
5. Approximate sorting: Given a small positive constant epsilon;, compute an ordering of the keys in which the rank assigned to each key agrees with its rank in the true ordering, within a relative error of epsilon.
Finally, as a byproduct of our analysis of approximate sorting, we give an elegant method for computing the expected number of comparisons for some classical randomized algorithms for selection and selection.
Note: Dr Karp will be receiving an honourary D Math from the University of Waterloo in the afternoon convocation of the same day
BIO:
One of the world's most eminent scholars in theoretical computer science, Richard Karp received a PhD in Mathematics from Harvard University in 1959. He was a member of the Mathematical Sciences Department at IBM Research in Yorktown, New York until 1968, at which time he moved to the University of California at Berkeley. He has been there ever since (except for the period 1995-1999 when he was at the University of Washington). Dr. Karp currently holds the post of University Professor in the Department of Electrical Engineering and Computer Science, with additional appointments in Mathematics, Bioengineering, and Operations Research. Among numerous awards and medals he has won, perhaps most notable is the Turing Award (1985), the most prestigious award in computer science, awarded annually by the Association for Computing Machinery for contributions considered of lasting and major technical importance to the computing community.