|
|
Alejandro (Alex) López-OrtizAssociate
Professor
|
|
The exponential growth of computing has had as a
consequence a large increase in the scale of the problems addressed by computer
scientists. For example, twenty years ago a few hundreds of megabytes
constituted a substantial text corpus. By the early 1990's, efficiently
handling of a few gigabytes was a reasonable goal. Nowadays search engines
index within a corporation measure data in terabytes of data for specialized
systems and are rapidly moving toward this figure for general business
practice. In other instances, we have seen similar growth in the size of large
data collections such as those derived from internet traffic management,
genomic data, and sensor networks. My research has focused in areas of
application–such as these–in which the impact of an improved
theoretical solution can be most felt. For example, in the context of
constraint propagation [1] (see Constraint Programming below) we proposed a
novel linear time algorithm that improved on the best known Ө(n2)
algorithm used in practice till then. Experimentally we determined the leading
constant of the Ө(n) speedup factor to be 1/2.
This means that in a problem of size 100,000 the time required to solve it went
down from about a year of computation to approximately ten minutes.
My research in general has as a goal the study of
real life applications that (i) are amenable to
theoretical analysis and (ii) lead to deeper understanding of the nature of
computation and information processing. Here’s a selection of the
problems I do research on (for a full list see my list of publications):
We study the effects of caching int the network. 1996 we proposed a push-cache protocol for
HTTP based on geographic and topological configurations. The simulations
presented were later validated when similar schemes were implemented by content
distribution networks (such as Akamai) on the public
Internet.
Another area of study is the introduction of
performance based routing using an overlay network over the public, BGP-based,
Internet infrastructure. This work has various aspects, including Internet
tomography and traffic balancing, and was incorporated into Internap
Network Services routing tool AS/3 (lead developer: O. Stiffelman).
The WWW, corporate emails or government legislation and public policy documents are typical examples of vast text collections. Aside from text, large data collections are also generated by genomic analysis, trading, sensors, electronic transactions and other such. Our goal is to propose improved methods for users to effectively search these collections and collate the results obtained. In particular we study algorithms for fast composition of a result set from the document sets corresponding to query terms. Faster algorithms also free up resources to be used for more intelligent processing of the result set. These sets are typically several million entries in size. We revisited algorithms for set operations and proposed an adaptive algorithm and shows that its running time is within a constant factor of optimal with respect to a natural measure of the difficulty of an instance.
Another problem confronted by search engine
designers is the sheer size of the index structure. If the text is available it
is in fact possible to do away with the index altogether, at a penalty in
performance, which becomes linear over the size of the corpus (like grep). In this case we showed that any algorithm with
performance linear over the size of the pattern (indeed subquadratic)
requires a linear size index, even if the text is available.
On-line
analysis is the study of algorithms that must make irrevocable decisions in the
presence of partial knowledge on the input. This framework can model a variety
of practical settings. A mobile robot for example, often navigates a changing,
partially known environment, such as a robot moving supplies in a shop floor
from the warehouse to milling machines. Initially my research focused on
solving special cases of this problem, then proceeded to extend the basic
search primitives so that they apply to more general settings. One of these
primitives is the linear line search problem which consists of a robot
exploring to semi-infinite rays. This problem can be generalized to the
exploration of m-rays by a single
robot, and of p robots exploring m-rays. Kao et al. observed that these
geometric search primitives apply to non-geometric searches, such as a set of
computer heuristics searching for the solution to a computationally hard
problem. Based on this idea we generalized the results of p robots and m rays to
optimally solve the case of m computers
working simultaneously on n problems
in the context of contract algorithms.
In terms of other on-line problems we also
considered the effects of caching effects in the memory hierarchy. The
cache-oblivious model allows programmers to reason about a two-level memory
hierarchy while proving results that hold on a multilevel memory hierarchy. We
found exact bounds on the time required for searching on the cache-oblivious
model. Our new search structure showed that in certain settings uneven splits
in a divide-and-conquer paradigm can, surprisingly, yield better performance.
In future research we will study the applicability of these ideas to other
settings such as cache-oblivious matrix multiplication.
It has long been known in the field of online
algorithms that the standard measure of performance, namely the competitive
ratio, in some cases leads to results that do not properly reflect the real
life performance of the strategy in question. Many researchers have addressed
this question as it relates to paging with partial degrees of success. In this
paper, building upon the work of Albers et al. we introduced a natural measure
of complexity that fully reflects the empirical separation of LRU (Least
Recently Used) from all other memory paging strategies. This is the first model
in which strict separation between LRU and all other strategies can be proven
thus settling a long standing conjecture. The model extends to other on-line
settings, which will be the subject of forthcoming papers.
The combined study of these various on-line
problems brought about a reconsideration of the competitive ratio under an
adversary model as the main technique of analysis. While this is the most
common metric in on-line algorithm analysis, it produces pessimistic measures
for certain types of problems, including paging algorithms and general
polygonal exploration. This is an observation shared by several other
researchers (see our survey on the subject). The shortcomings of the
competitive ratio led us to rethink the entire foundations of competitive
analysis. A further generalization to other settings, which we have termed the cooperative ratio, also seems a good
candidate to produce measurements of online algorithms that better reflect
actual practice. We are optimistic about the outcome of this research.
Constraint
programming aims to solve computationally intractable problems within the best
(exponential) time possible. In particular constraint propagation aims to
assist an exhaustive search by detecting as early as possible when a partial
solution to a problem is no longer self-consistent. This has been the subject
of intensive study in the Artificial Intelligence community and its results are
widely used in practice (e.g. the ILOG solver). The alldifferent
constraint is an important scheduling p