|
|
|||
The RankDB Project is working on efficiently supporting top-k and preference queries in Database Systems. We are interested in the various aspects of the problem including preference and rank specification, rank-aware data models, integrating ranking in practical query processors and optimizers, and adaptive processing of top-k queries in non-traditional environment.
Overview [top]Despite their robust and efficient data management capabilities, current database systems lack efficient means of handling of IR-style queries, which are dominant in many emerging applications. These types of queries pose several challenges to the design and implementation of database engines, especially with respect to query processing and optimization. For example, applications in multimedia and web databases have notions of proximity, ranking, and user feedback that have serious impact on the query model, which cannot be adequately handled by current query processing approaches. Two of the data management challenges posed by these emerging applications are the following:
RankDB focuses on a number of research directions to address these challenges: (1) enabling ranked retrieval in current relational query processors as a first-class functionality; (2) extending rank-aware query processing beyond relational data models to text and semi-structured databases, which are becoming an integral part of current applications infrastructure; and (3) introducing new rank-aware query processing paradigms that offer a high degree of adaptability and scalability. Example 1: Multimedia Retrieval by Contents [top]In a video database system (VDBMS), users provide an example image and request the video objects (segments or snapshots) most similar to that image. The similarity is measured with respect to several extracted visual features, e.g., the color histograms, the color layout, the texture or the orientation of the edges. Similarity queries are well studied and can give the most similar objects according to each feature separately. A major challenge in all retrieval-by-contents applications is how to combine several rankings into one global ranking of database objects.
The Query Image
The top 5 video shots w.r.t the three features combined using Rank-join
Figure 1: A multi-feature query in a multimedia retrieval by contents
Example 2: Ranking over External Sources [top]A user is interested in finding the most economic relocation choices by querying two available databases of Houses and Schools. The objective is to get the location with the lowest expenses, where the total expenses are computed as the sum of the price of the house and the school tuition for ten years. The simplest way to put this query in pseudo-SQL constructs is as follows: SELECT H.id, S.name Traditionally, the only way to answer this query is to compute all the join results (house-school pairs that are in the same location), then to sort the results according to the total cost. The solution is prohibitively expensive if the databases of houses and schools are very large or if the Houses and Schools are web databases or from external sources. Alternatively, if we can access the inputs from the two databases in a sorted access (ranked by house price and school tuition separately), my proposed rank-join algorithm produces the top k join results without consuming all the inputs. The performance benefit can be orders of magnitude over the traditional solution
In general, we focus on efficiently answering the general form of a top-k query given by the following query template, Q: Q: SELECT some attributes where F is a monotone function that combines the scores of individual inputs R1 .. Rm into one score for each join result.
The Rank-join Algorithm [top]For simplicity, the algorithm is described for the binary case. Given two ranked inputs and a join condition, the rank-join algorithm produces the top k join results as soon as possible and without consuming all the inputs. The algorithm maintains a state which consists of a priority queue for buffering all join results that cannot yet be produced. The algorithm also maintains a threshold value that upper-bounds the scores of all "unseen" join results. Using the buffer and the threshold, the algorithm can decide if one of the produced join results is guaranteed to be among the top k answers. The rank-join algorithm is incremental in the sense that it reports the next top ranked result for each call of the algorithm. The main idea is to go in parallel (not necessarily in lock-step) in the inputs retrieving the next ranked tuple from each input and to join that tuple with all tuples seen so far. Figure 2 gives an example execution of the algorithm for two inputs, L and R. The join condition is L.A = R.A. At some instance in the algorithm execution, the scores of the last tuple retrieved from the L and R are L_bottom and R_bottom, respectively. The scores of the top ranked tuples in L and R are L_top and R_top respectively. The threshold is computed as T = Max (F(L_top,R_bottom) , F(L_bottom,R_top)) where F is the score combining function. In this example if F is the sum of scores, T = Max(100+3, 25+5) = 103. The join result J:[(1,1,100),(2,1,4)] is a valid join result that joins the first ranked tuple in L with the second ranked tuple in R and has the score of 100+4 = 104. Hence, J can be reported as the top ranked join result without continuing retrieving tuples from inputs, which can be arbitrary large.
Figure2: The Rank-join Algorithm - the Binary Case
Rank-join Query Operators [top]To support rank aggregation algorithms in a database
system, we have the choice of implementing these algorithms at the
application level as user-defined functions or to implement them as core
query operators (rank-join operators). Although the latter approach
requires more effort in changing
Rank-join operators are almost non-blocking. The next
ranked join result is usually produced in a pipelined fashion without
the need to exhaust all the inputs. On the other hand, a rank-join
operator may need to exhaust part of the inputs before being able to
report the next ranked join result. We proved that rank-join operators
can achieve a huge benefit over the traditional join-then-sort approach
to answer top-k join queries especially for small values of k. We introduced two possible rank-join implementations: nested-loops rank-join (NRJN) and hash rank-join (HRJN). For any rank-join operator, the operator is initialized by specifying the two inputs, the join condition, and the combining function. HRJN can be viewed as a variant of the symmetrical hash join algorithm or the hash ripple join algorithm. HRJN maintains an internal state that consists of three structures. The first two structures are two hash tables, i.e., one for each input. The hash tables hold input tuples seen so far and are used in order to compute the valid join results. The third structure is a priority queue that holds the valid join combinations ordered on their combined score. NRJN is similar to HRJN except that it follows a nested-loops strategy to generate valid join results. NRJN internal states contains only a priority queue of all seen join combinations. Similar to HRJN, NRJN maintains a threshold that upper-bounds the scores of all unseen join results. The depth of a rank-join operator is defined as the number of input tuples needed to produce top-k join results (a) (b) Figure3: Pushing the Sorting into the Join in the Rank-join Algorithm Rank-aware Query Optimization [top] We introduce a rank-aware query optimization framework that fully integrates rank-join operators into relational query engines. We investigated two different ways to achieve this task: (1) Extending Join Space Through Interesting Ranking Properties The framework is based on extending the System R dynamic programming algorithm in both enumeration and pruning. We define ranking as an interesting property that triggers the generation of rank-aware query plans. Unlike traditional join operators, optimizing for rank-join operators depends on estimating the input cardinality of these operators. We introduce a probabilistic model for estimating the input cardinality, and hence the cost of a rank-join operator.
Figure4: Interesting Ranking Expressions Figure 4 (a) gives the MEMO structure for the traditional application of the Dynamic Programming enumeration without the proposed extension. For example, we keep two plans for Table A; the cheapest plan that does not have any order property (DC) and the cheapest plan that produces results ordered on A.c2 as an interesting order. Figure 4 (b) shows the newly generated classes of plans that preserve the required ranking. For each interesting order expression, the cheapest plan that produces that order is retained. For example, in generating plans that join Tables A and B, we keep the cheapest plan that produces results ordered on 0.3*A.c1 + 0.3*B.c1. (2) Two-Dimensional Space Enumeration With Sampling-based Statistics We take a principled way to extend DP plan enumeration by treating ranking predicates as another dimension of enumeration in addition to Boolean predicates, based on the insight that the ranking (order) relationship is another logical property of data, parallel to membership. Recall that, ranked relation essentially possesses two logical properties: Boolean membership, and ranking order. In a ranking query plan, new ranking predicates are only introduced in operators. Therefore the predicate set of a sub-plan, determines the order, just like how join conditions (together with other operations) determine the membership. To enable rank-aware query processing and optimization, we extend relational algebra to be rank-relational algebra, where the relations, operators, and algebraic laws respect and take advantage of the essential concept of rank. We define rank relational model and extend relational algebra. The new rank-relational algebra enables and determines our query execution model and operator implementations. We also introduce several laws of the new algebra to lay the foundation of query optimization. (1) Rank-Relation
(2) Rank-Relational Operators (3) Algebraic Equivalence Laws: |
||||
Copyright © 2005 by the University of Waterloo. All rights reserved. |